CURSO DE R PARA ANÁLISE DE DADOS

Tidymodels

Tidymodels é um conjunto de bibliotecas que cuida de todos os passos necessários para desenvolver o workflow de seleção e avaliação de modelos de aprendizado estatístico.

O desenvolvimento é financiado pela RStudio e liderado por Max Kuhn, o principal desenvolvedor de uma biblioteca similar mais antiga: caret.

A tidymodels é toda tidy friendly. Essa é uma das diferenças em relação à caret. Ela também é mais completa e possui muito mais funcionalidades.

É possível obter mais informações em tidymodels.org

Bibliotecas ortogonais

Tidymodels é formada por pacotes ortogonais.

Este termo é emprestado da matemática. No caso de dois vetores ortogonais, podemos nos mover na direção de um deles sem que nossa projeção no outro seja alterada.

Em programação ou arquitetura de software dizemos que componentes ortogonais são desacoplados: a mudança em um componente não afeta outros. Esta propriedade exige componentes menores e mais coesos, com responsabilidades bem definidas, e permite alterações com menos efeitos colaterais. Um bom livro para quem quer entender como usar conceitos como esse em programação se chama Pragmatic Programmer, de David Thomas e Andrew Hunt.

As bibliotecas que compõem a tidymodels funcionam assim: ao configurar o workflow que vai implementar o processo de treinamento, seleção e avaliação de modelos, várias etapas ortogonais vão ser preparadas com uso de várias bibliotecas.

Principais bibliotecas

Estudo de caso

Os dados vieram de um estudo de pesquisadores da Columbia Business School, Ray Fisman and Sheena Iyenga.

Eles fizeram várias rodadas de encontros de 4 minutos entre homens e mulheres heterossexuais.

Várias características foram coletadas, incluindo um veredito final determinando se cada parceiro de encinto gostou do outro.

Dados do estudo de caso

Os dados foram coletados no site Kaggle

Eles não estão redondos…

dados_speed_date <- read_csv("dados/speed/Speed Dating Data.csv")


glimpse(dados_speed_date)
## Rows: 8,378
## Columns: 195
## $ iid      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ id       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ gender   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ idg      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ condtn   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ wave     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ round    <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10...
## $ position <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ positin1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ order    <dbl> 4, 3, 10, 5, 7, 6, 1, 2, 8, 9, 10, 9, 6, 1, 3, 2, 7, 8, 4,...
## $ partner  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
## $ pid      <dbl> 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 11, 12, 13, 14, 15...
## $ match    <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0...
## $ int_corr <dbl> 0.14, 0.54, 0.16, 0.61, 0.21, 0.25, 0.34, 0.50, 0.28, -0.3...
## $ samerace <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1...
## $ age_o    <dbl> 27, 22, 22, 23, 24, 25, 30, 27, 28, 24, 27, 22, 22, 23, 24...
## $ race_o   <dbl> 2, 2, 4, 2, 3, 2, 2, 2, 2, 2, 2, 2, 4, 2, 3, 2, 2, 2, 2, 2...
## $ pf_o_att <dbl> 35.00, 60.00, 19.00, 30.00, 30.00, 50.00, 35.00, 33.33, 50...
## $ pf_o_sin <dbl> 20.00, 0.00, 18.00, 5.00, 10.00, 0.00, 15.00, 11.11, 0.00,...
## $ pf_o_int <dbl> 20.00, 0.00, 19.00, 15.00, 20.00, 30.00, 25.00, 11.11, 25....
## $ pf_o_fun <dbl> 20.00, 40.00, 18.00, 40.00, 10.00, 10.00, 10.00, 11.11, 10...
## $ pf_o_amb <dbl> 0.00, 0.00, 14.00, 5.00, 10.00, 0.00, 5.00, 11.11, 0.00, 0...
## $ pf_o_sha <dbl> 5.00, 0.00, 12.00, 5.00, 20.00, 10.00, 10.00, 22.22, 15.00...
## $ dec_o    <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0...
## $ attr_o   <dbl> 6, 7, 10, 7, 8, 7, 3, 6, 7, 6, 8, 7, 10, 9, 10, 7, 5, 7, 8...
## $ sinc_o   <dbl> 8, 8, 10, 8, 7, 7, 6, 7, 7, 6, 7, 6, 10, 9, 10, 8, 3, 7, 6...
## $ intel_o  <dbl> 8, 10, 10, 9, 9, 8, 7, 5, 8, 6, 6, 10, 10, 9, 10, 7, 4, 7,...
## $ fun_o    <dbl> 8, 7, 10, 8, 6, 8, 5, 6, 8, 6, 9, 6, 10, 9, 10, 5, 3, 7, 9...
## $ amb_o    <dbl> 8, 7, 10, 9, 9, 7, 8, 8, 8, 6, 7, 6, 10, 9, 7, 7, 5, 7, 8,...
## $ shar_o   <dbl> 6, 5, 10, 8, 7, 7, 7, 6, 9, 6, 4, 5, 10, 9, 8, 7, 3, 5, 7,...
## $ like_o   <dbl> 7.0, 8.0, 10.0, 7.0, 8.0, 7.0, 2.0, 7.0, 6.5, 6.0, 7.0, 8....
## $ prob_o   <dbl> 4, 4, 10, 7, 6, 6, 1, 5, 8, 6, 2, 4, 10, 7, 1, 5, 3, 6, 8,...
## $ met_o    <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2...
## $ age      <dbl> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 24, 24, 24, 24, 24...
## $ field    <chr> "Law", "Law", "Law", "Law", "Law", "Law", "Law", "Law", "L...
## $ field_cd <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ undergra <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ mn_sat   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ tuition  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ race     <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ imprace  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ imprelig <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ from     <chr> "Chicago", "Chicago", "Chicago", "Chicago", "Chicago", "Ch...
## $ zipcode  <dbl> 60521, 60521, 60521, 60521, 60521, 60521, 60521, 60521, 60...
## $ income   <dbl> 69487, 69487, 69487, 69487, 69487, 69487, 69487, 69487, 69...
## $ goal     <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ date     <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ go_out   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ career   <chr> "lawyer", "lawyer", "lawyer", "lawyer", "lawyer", "lawyer"...
## $ career_c <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sports   <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ tvsports <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ exercise <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ dining   <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, ...
## $ museums  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ art      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ hiking   <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ gaming   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ clubbing <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ reading  <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 10, 10, 10, 10, 10, 10, 10, ...
## $ tv       <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ theater  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ movies   <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 8, 8, 8, 8, 8, 8, ...
## $ concerts <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 7, 7, 7, 7, 7, 7, ...
## $ music    <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ shopping <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ yoga     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ exphappy <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ expnum   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ attr1_1  <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 45, 45, 45, 45, 45...
## $ sinc1_1  <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 5, 5, 5, 5, 5, 5, ...
## $ intel1_1 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 25, 25, 25, 25, 25...
## $ fun1_1   <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 20, 20, 20, 20, 20...
## $ amb1_1   <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, ...
## $ shar1_1  <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 5, 5, 5, 5, 5, 5, ...
## $ attr4_1  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc4_1  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel4_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun4_1   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb4_1   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar4_1  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr2_1  <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 65, 65, 65, 65, 65...
## $ sinc2_1  <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 0, 0, 0, 0, 0, 0, ...
## $ intel2_1 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 10, 10, 10, 10, 10...
## $ fun2_1   <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 25, 25, 25, 25, 25...
## $ amb2_1   <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ shar2_1  <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attr3_1  <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ sinc3_1  <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ fun3_1   <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 10, ...
## $ intel3_1 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ amb3_1   <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ attr5_1  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc5_1  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel5_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun5_1   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb5_1   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ dec      <dbl> 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1...
## $ attr     <dbl> 6, 7, 5, 7, 5, 4, 7, 4, 7, 5, 5, 8, 5, 7, 6, 8, 7, 5, 7, 6...
## $ sinc     <dbl> 9, 8, 8, 6, 6, 9, 6, 9, 6, 6, 7, 5, 8, 9, 8, 7, 5, 8, 6, 7...
## $ intel    <dbl> 7, 7, 9, 8, 7, 7, 7, 7, 8, 6, 8, 6, 9, 7, 7, 8, 9, 7, 8, 8...
## $ fun      <dbl> 7, 8, 8, 7, 7, 4, 4, 6, 9, 8, 4, 6, 6, 6, 9, 3, 6, 5, 9, 7...
## $ amb      <dbl> 6, 5, 5, 6, 6, 6, 6, 5, 8, 10, 6, 9, 3, 5, 7, 6, 7, 9, 4, ...
## $ shar     <dbl> 5, 6, 7, 8, 6, 4, 7, 6, 8, 8, 3, 6, 4, 7, 8, 2, 9, 5, 5, 8...
## $ like     <dbl> 7, 7, 7, 7, 6, 6, 6, 6, 7, 6, 6, 7, 6, 7, 8, 6, 8, 5, 5, 8...
## $ prob     <dbl> 6, 5, NA, 6, 6, 5, 5, 7, 7, 6, 4, 3, 7, 8, 6, 5, 7, 6, 6, ...
## $ met      <dbl> 2, 1, 1, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, ...
## $ match_es <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ attr1_s  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc1_s  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel1_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun1_s   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb1_s   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar1_s  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr3_s  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc3_s  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel3_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun3_s   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb3_s   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ satis_2  <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ length   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ numdat_2 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, NA, ...
## $ attr7_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc7_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel7_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun7_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb7_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar7_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr1_2  <dbl> 19.44, 19.44, 19.44, 19.44, 19.44, 19.44, 19.44, 19.44, 19...
## $ sinc1_2  <dbl> 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16...
## $ intel1_2 <dbl> 13.89, 13.89, 13.89, 13.89, 13.89, 13.89, 13.89, 13.89, 13...
## $ fun1_2   <dbl> 22.22, 22.22, 22.22, 22.22, 22.22, 22.22, 22.22, 22.22, 22...
## $ amb1_2   <dbl> 11.11, 11.11, 11.11, 11.11, 11.11, 11.11, 11.11, 11.11, 11...
## $ shar1_2  <dbl> 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16...
## $ attr4_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc4_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel4_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun4_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb4_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar4_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr2_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc2_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel2_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun2_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb2_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar2_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr3_2  <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ sinc3_2  <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ intel3_2 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ fun3_2   <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ amb3_2   <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ attr5_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc5_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel5_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun5_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb5_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ you_call <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ them_cal <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ date_3   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ numdat_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ num_in_3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr1_3  <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 30, 30, 30, 30, 30...
## $ sinc1_3  <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 5, 5, 5, 5, 5, 5, ...
## $ intel1_3 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 40, 40, 40, 40, 40...
## $ fun1_3   <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15...
## $ amb1_3   <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, ...
## $ shar1_3  <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 10, 10, 10, 10, 10...
## $ attr7_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc7_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel7_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun7_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb7_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar7_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr4_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc4_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel4_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun4_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb4_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar4_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr2_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc2_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel2_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun2_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb2_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar2_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr3_3  <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ sinc3_3  <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ intel3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ fun3_3   <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ amb3_3   <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ attr5_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc5_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel5_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun5_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb5_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

Renomeando coluna por coluna

Algumas colunas devem ser renomeadas para nomes mais inteligíveis

dados_speed_date_renomeado <- dados_speed_date %>% 
  rename(
    unique_id_number = iid,
    id_within_wave = id,
    male = gender,
    subject_within_gender = idg,
    choice = condtn,
    n_people_met_in_wave = round,
    position_meeting = position,
    position_started = positin1,
    order_meeting = order,
    partnet_id_within_wave = partner,
    partner_unique_id_number =pid ,
    interests_correlation = int_corr,
    same_race = samerace,
    my_age = age,
    partner_age = age_o,
    partner_race = race_o,
    partner_stated_pref_time0_attractive = pf_o_att,
    partner_stated_pref_time0_sincere = pf_o_sin,
    partner_stated_pref_time0_intelligent = pf_o_int,
    partner_stated_pref_time0_fun = pf_o_fun,
    partner_stated_pref_time0_ambitious = pf_o_amb,
    partner_stated_pref_time0_shared_interests = pf_o_sha,
    cod_field = field_cd,
    importance_same_race = imprace,
    importance_same_religion = imprelig,
    place_from = from,
    zipcode = zipcode,
    income_zipcode = income,
    frequency_date = date,
    frequency_go_out = go_out,
    career_macro = career_c,
    happy_expec = exphappy,
    n_expect_like_you = expnum,
    i_liked_partner = dec,
    partner_liked_me = dec_o,
    i_found_partner__attractive = attr,
    i_found_partner__sincere = sinc,
    i_found_partner__intelligent = intel,
    i_found_partner__fun = fun,
    i_found_partner__ambitious = amb,
    i_found_partner__interests = shar,
    degree_i_liked_partner = like,
    partner_found_me__attractive = attr_o,
    partner_found_me__sincere = sinc_o,
    partner_found_me__intelligent = intel_o,
    partner_found_me__fun = fun_o,
    partner_found_me__ambitious = amb_o,
    partner_found_me__interests = shar_o,
    probability_i_find_partner_liked_me = prob,
    met_before = met,
    n_matches_you_think = match_es,
    satisfaction_with_partners = satis_2,
    opinion_duration_of_date = length,
    opinion_num_dates = numdat_2,
    num_matches_you_called = you_call,
    num_matches_called_you = them_cal,
    have_you_dated = date_3
    
  )

glimpse(dados_speed_date_renomeado)
## Rows: 8,378
## Columns: 195
## $ unique_id_number                           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ id_within_wave                             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ male                                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ subject_within_gender                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ choice                                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ wave                                       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ n_people_met_in_wave                       <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting                           <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ position_started                           <lgl> NA, NA, NA, NA, NA, NA, ...
## $ order_meeting                              <dbl> 4, 3, 10, 5, 7, 6, 1, 2,...
## $ partnet_id_within_wave                     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, ...
## $ partner_unique_id_number                   <dbl> 11, 12, 13, 14, 15, 16, ...
## $ match                                      <dbl> 0, 0, 1, 1, 1, 0, 0, 0, ...
## $ interests_correlation                      <dbl> 0.14, 0.54, 0.16, 0.61, ...
## $ same_race                                  <dbl> 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ partner_age                                <dbl> 27, 22, 22, 23, 24, 25, ...
## $ partner_race                               <dbl> 2, 2, 4, 2, 3, 2, 2, 2, ...
## $ partner_stated_pref_time0_attractive       <dbl> 35.00, 60.00, 19.00, 30....
## $ partner_stated_pref_time0_sincere          <dbl> 20.00, 0.00, 18.00, 5.00...
## $ partner_stated_pref_time0_intelligent      <dbl> 20.00, 0.00, 19.00, 15.0...
## $ partner_stated_pref_time0_fun              <dbl> 20.00, 40.00, 18.00, 40....
## $ partner_stated_pref_time0_ambitious        <dbl> 0.00, 0.00, 14.00, 5.00,...
## $ partner_stated_pref_time0_shared_interests <dbl> 5.00, 0.00, 12.00, 5.00,...
## $ partner_liked_me                           <dbl> 0, 0, 1, 1, 1, 1, 0, 0, ...
## $ partner_found_me__attractive               <dbl> 6, 7, 10, 7, 8, 7, 3, 6,...
## $ partner_found_me__sincere                  <dbl> 8, 8, 10, 8, 7, 7, 6, 7,...
## $ partner_found_me__intelligent              <dbl> 8, 10, 10, 9, 9, 8, 7, 5...
## $ partner_found_me__fun                      <dbl> 8, 7, 10, 8, 6, 8, 5, 6,...
## $ partner_found_me__ambitious                <dbl> 8, 7, 10, 9, 9, 7, 8, 8,...
## $ partner_found_me__interests                <dbl> 6, 5, 10, 8, 7, 7, 7, 6,...
## $ like_o                                     <dbl> 7.0, 8.0, 10.0, 7.0, 8.0...
## $ prob_o                                     <dbl> 4, 4, 10, 7, 6, 6, 1, 5,...
## $ met_o                                      <dbl> 2, 2, 1, 2, 2, 2, 2, 2, ...
## $ my_age                                     <dbl> 21, 21, 21, 21, 21, 21, ...
## $ field                                      <chr> "Law", "Law", "Law", "La...
## $ cod_field                                  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ undergra                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ mn_sat                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ tuition                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ race                                       <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ importance_same_race                       <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ importance_same_religion                   <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ place_from                                 <chr> "Chicago", "Chicago", "C...
## $ zipcode                                    <dbl> 60521, 60521, 60521, 605...
## $ income_zipcode                             <dbl> 69487, 69487, 69487, 694...
## $ goal                                       <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ frequency_date                             <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ frequency_go_out                           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ career                                     <chr> "lawyer", "lawyer", "law...
## $ career_macro                               <dbl> NA, NA, NA, NA, NA, NA, ...
## $ sports                                     <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ tvsports                                   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ exercise                                   <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ dining                                     <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ museums                                    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ art                                        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ hiking                                     <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ gaming                                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ clubbing                                   <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ reading                                    <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ tv                                         <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ theater                                    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ movies                                     <dbl> 10, 10, 10, 10, 10, 10, ...
## $ concerts                                   <dbl> 10, 10, 10, 10, 10, 10, ...
## $ music                                      <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ shopping                                   <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ yoga                                       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ happy_expec                                <dbl> 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ n_expect_like_you                          <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ attr1_1                                    <dbl> 15, 15, 15, 15, 15, 15, ...
## $ sinc1_1                                    <dbl> 20, 20, 20, 20, 20, 20, ...
## $ intel1_1                                   <dbl> 20, 20, 20, 20, 20, 20, ...
## $ fun1_1                                     <dbl> 15, 15, 15, 15, 15, 15, ...
## $ amb1_1                                     <dbl> 15, 15, 15, 15, 15, 15, ...
## $ shar1_1                                    <dbl> 15, 15, 15, 15, 15, 15, ...
## $ attr4_1                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc4_1                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel4_1                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun4_1                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb4_1                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar4_1                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr2_1                                    <dbl> 35, 35, 35, 35, 35, 35, ...
## $ sinc2_1                                    <dbl> 20, 20, 20, 20, 20, 20, ...
## $ intel2_1                                   <dbl> 15, 15, 15, 15, 15, 15, ...
## $ fun2_1                                     <dbl> 20, 20, 20, 20, 20, 20, ...
## $ amb2_1                                     <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ shar2_1                                    <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ attr3_1                                    <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ sinc3_1                                    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ fun3_1                                     <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ intel3_1                                   <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ amb3_1                                     <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ attr5_1                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc5_1                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel5_1                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun5_1                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb5_1                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ i_liked_partner                            <dbl> 1, 1, 1, 1, 1, 0, 1, 0, ...
## $ i_found_partner__attractive                <dbl> 6, 7, 5, 7, 5, 4, 7, 4, ...
## $ i_found_partner__sincere                   <dbl> 9, 8, 8, 6, 6, 9, 6, 9, ...
## $ i_found_partner__intelligent               <dbl> 7, 7, 9, 8, 7, 7, 7, 7, ...
## $ i_found_partner__fun                       <dbl> 7, 8, 8, 7, 7, 4, 4, 6, ...
## $ i_found_partner__ambitious                 <dbl> 6, 5, 5, 6, 6, 6, 6, 5, ...
## $ i_found_partner__interests                 <dbl> 5, 6, 7, 8, 6, 4, 7, 6, ...
## $ degree_i_liked_partner                     <dbl> 7, 7, 7, 7, 6, 6, 6, 6, ...
## $ probability_i_find_partner_liked_me        <dbl> 6, 5, NA, 6, 6, 5, 5, 7,...
## $ met_before                                 <dbl> 2, 1, 1, 2, 2, 2, 2, NA,...
## $ n_matches_you_think                        <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ attr1_s                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc1_s                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel1_s                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun1_s                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb1_s                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar1_s                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr3_s                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc3_s                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel3_s                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun3_s                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb3_s                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ satisfaction_with_partners                 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ opinion_duration_of_date                   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ opinion_num_dates                          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ attr7_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc7_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel7_2                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun7_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb7_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar7_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr1_2                                    <dbl> 19.44, 19.44, 19.44, 19....
## $ sinc1_2                                    <dbl> 16.67, 16.67, 16.67, 16....
## $ intel1_2                                   <dbl> 13.89, 13.89, 13.89, 13....
## $ fun1_2                                     <dbl> 22.22, 22.22, 22.22, 22....
## $ amb1_2                                     <dbl> 11.11, 11.11, 11.11, 11....
## $ shar1_2                                    <dbl> 16.67, 16.67, 16.67, 16....
## $ attr4_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc4_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel4_2                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun4_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb4_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar4_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr2_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc2_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel2_2                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun2_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb2_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar2_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr3_2                                    <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ sinc3_2                                    <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ intel3_2                                   <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ fun3_2                                     <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ amb3_2                                     <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ attr5_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc5_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel5_2                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun5_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb5_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ num_matches_you_called                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ num_matches_called_you                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ have_you_dated                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ numdat_3                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ num_in_3                                   <dbl> NA, NA, NA, NA, NA, NA, ...
## $ attr1_3                                    <dbl> 15, 15, 15, 15, 15, 15, ...
## $ sinc1_3                                    <dbl> 20, 20, 20, 20, 20, 20, ...
## $ intel1_3                                   <dbl> 20, 20, 20, 20, 20, 20, ...
## $ fun1_3                                     <dbl> 15, 15, 15, 15, 15, 15, ...
## $ amb1_3                                     <dbl> 15, 15, 15, 15, 15, 15, ...
## $ shar1_3                                    <dbl> 15, 15, 15, 15, 15, 15, ...
## $ attr7_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc7_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel7_3                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun7_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb7_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar7_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr4_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc4_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel4_3                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun4_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb4_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar4_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr2_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc2_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel2_3                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun2_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb2_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar2_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr3_3                                    <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ sinc3_3                                    <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ intel3_3                                   <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ fun3_3                                     <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ amb3_3                                     <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ attr5_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc5_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel5_3                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun5_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb5_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...

Renomeando conjuntos misteriosos em lote

Ainda há colunas com sufixos misteriosos, como 1_1

adjust_column_feature <- function(x, suffix, meaning ){
      

      suffix_removed <- str_remove(string = x, pattern = suffix)
      
      type <-  case_when(
        suffix_removed == "attr" ~ "attractive",
        suffix_removed == "sinc" ~ "sincere",
        suffix_removed == "intel" ~ "intelligent",
        suffix_removed == "fun" ~ "fun",
        suffix_removed == "amb" ~ "ambitious",
        suffix_removed == "shar" ~ "shared_interests"
      ) 

      str_glue("{meaning}_{type}")
      
}




dados_speed_date_rename_with <- dados_speed_date_renomeado %>% 
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)4_1"),
    .fn = ~adjust_column_feature(x = .x, suffix = "4_1", meaning = "competitors_look_for_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)4_2"),
    .fn = ~adjust_column_feature(x = .x, suffix = "4_2", meaning = "competitors_look_for_follow_up_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)4_3"),
    .fn = ~adjust_column_feature(x = .x, suffix = "4_3", meaning = "competitors_look_for_follow_up_weeks_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_1"),
    .fn = ~adjust_column_feature(x = .x, suffix = "1_1", meaning = "you_look_for_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_s"),
    .fn = ~adjust_column_feature(x = .x, suffix = "1_s", meaning = "you_look_for_half_way_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_2"),
    .fn = ~adjust_column_feature(x = .x, suffix = "1_2", meaning = "you_look_for_follow_up_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_3"),
    .fn = ~adjust_column_feature(x = .x, suffix = "1_3", meaning = "you_look_for_follow_up_weeks_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)2_1"),
    .fn = ~adjust_column_feature(x = .x, suffix = "2_1", meaning = "you_think_opposite_sex_look_for_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)2_2"),
    .fn = ~adjust_column_feature(x = .x, suffix = "2_2", meaning = "you_think_opposite_sex_look_for_follow_up_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)2_3"),
    .fn = ~adjust_column_feature(x = .x, suffix = "2_3", meaning = "you_think_opposite_sex_look_for_follow_up_weeks_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)5_1"),
    .fn = ~adjust_column_feature(x = .x, suffix = "5_1", meaning = "others_perceive_you_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)5_2"),
    .fn = ~adjust_column_feature(x = .x, suffix = "5_2", meaning = "others_perceive_you_follow_up_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)5_3"),
    .fn = ~adjust_column_feature(x = .x, suffix = "5_3", meaning = "others_perceive_you_follow_up_weeks_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_1"),
    .fn = ~adjust_column_feature(x = .x, suffix = "3_1", meaning = "you_perceive_yourself_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_2"),
    .fn = ~adjust_column_feature(x = .x, suffix = "3_2", meaning = "you_perceive_yourself_follow_up_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_s"),
    .fn = ~adjust_column_feature(x = .x, suffix = "3_s", meaning = "you_perceive_yourself_half_way_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_3"),
    .fn = ~adjust_column_feature(x = .x, suffix = "3_3", meaning = "you_perceive_yourself_follow_up_weeks_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)7_2"),
    .fn = ~adjust_column_feature(x = .x, suffix = "7_2", meaning = "actual_importance_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)7_3"),
    .fn = ~adjust_column_feature(x = .x, suffix = "7_3", meaning = "actual_importance_follow_up_weeks_")
  ) %>%
  select(  
    -c(
      undergra,
      mn_sat,
      tuition
    )
  )  

glimpse(dados_speed_date_rename_with)
## Rows: 8,378
## Columns: 192
## $ unique_id_number                                                  <dbl> 1...
## $ id_within_wave                                                    <dbl> 1...
## $ male                                                              <dbl> 0...
## $ subject_within_gender                                             <dbl> 1...
## $ choice                                                            <dbl> 1...
## $ wave                                                              <dbl> 1...
## $ n_people_met_in_wave                                              <dbl> 1...
## $ position_meeting                                                  <dbl> 7...
## $ position_started                                                  <lgl> N...
## $ order_meeting                                                     <dbl> 4...
## $ partnet_id_within_wave                                            <dbl> 1...
## $ partner_unique_id_number                                          <dbl> 1...
## $ match                                                             <dbl> 0...
## $ interests_correlation                                             <dbl> 0...
## $ same_race                                                         <dbl> 0...
## $ partner_age                                                       <dbl> 2...
## $ partner_race                                                      <dbl> 2...
## $ partner_stated_pref_time0_attractive                              <dbl> 3...
## $ partner_stated_pref_time0_sincere                                 <dbl> 2...
## $ partner_stated_pref_time0_intelligent                             <dbl> 2...
## $ partner_stated_pref_time0_fun                                     <dbl> 2...
## $ partner_stated_pref_time0_ambitious                               <dbl> 0...
## $ partner_stated_pref_time0_shared_interests                        <dbl> 5...
## $ partner_liked_me                                                  <dbl> 0...
## $ partner_found_me__attractive                                      <dbl> 6...
## $ partner_found_me__sincere                                         <dbl> 8...
## $ partner_found_me__intelligent                                     <dbl> 8...
## $ partner_found_me__fun                                             <dbl> 8...
## $ partner_found_me__ambitious                                       <dbl> 8...
## $ partner_found_me__interests                                       <dbl> 6...
## $ like_o                                                            <dbl> 7...
## $ prob_o                                                            <dbl> 4...
## $ met_o                                                             <dbl> 2...
## $ my_age                                                            <dbl> 2...
## $ field                                                             <chr> "...
## $ cod_field                                                         <dbl> 1...
## $ race                                                              <dbl> 4...
## $ importance_same_race                                              <dbl> 2...
## $ importance_same_religion                                          <dbl> 4...
## $ place_from                                                        <chr> "...
## $ zipcode                                                           <dbl> 6...
## $ income_zipcode                                                    <dbl> 6...
## $ goal                                                              <dbl> 2...
## $ frequency_date                                                    <dbl> 7...
## $ frequency_go_out                                                  <dbl> 1...
## $ career                                                            <chr> "...
## $ career_macro                                                      <dbl> N...
## $ sports                                                            <dbl> 9...
## $ tvsports                                                          <dbl> 2...
## $ exercise                                                          <dbl> 8...
## $ dining                                                            <dbl> 9...
## $ museums                                                           <dbl> 1...
## $ art                                                               <dbl> 1...
## $ hiking                                                            <dbl> 5...
## $ gaming                                                            <dbl> 1...
## $ clubbing                                                          <dbl> 5...
## $ reading                                                           <dbl> 6...
## $ tv                                                                <dbl> 9...
## $ theater                                                           <dbl> 1...
## $ movies                                                            <dbl> 1...
## $ concerts                                                          <dbl> 1...
## $ music                                                             <dbl> 9...
## $ shopping                                                          <dbl> 8...
## $ yoga                                                              <dbl> 1...
## $ happy_expec                                                       <dbl> 3...
## $ n_expect_like_you                                                 <dbl> 2...
## $ you_look_for__attractive                                          <dbl> 1...
## $ you_look_for__sincere                                             <dbl> 2...
## $ you_look_for__intelligent                                         <dbl> 2...
## $ you_look_for__fun                                                 <dbl> 1...
## $ you_look_for__ambitious                                           <dbl> 1...
## $ you_look_for__shared_interests                                    <dbl> 1...
## $ competitors_look_for__attractive                                  <lgl> N...
## $ competitors_look_for__sincere                                     <lgl> N...
## $ competitors_look_for__intelligent                                 <lgl> N...
## $ competitors_look_for__fun                                         <lgl> N...
## $ competitors_look_for__ambitious                                   <lgl> N...
## $ competitors_look_for__shared_interests                            <lgl> N...
## $ you_think_opposite_sex_look_for__attractive                       <dbl> 3...
## $ you_think_opposite_sex_look_for__sincere                          <dbl> 2...
## $ you_think_opposite_sex_look_for__intelligent                      <dbl> 1...
## $ you_think_opposite_sex_look_for__fun                              <dbl> 2...
## $ you_think_opposite_sex_look_for__ambitious                        <dbl> 5...
## $ you_think_opposite_sex_look_for__shared_interests                 <dbl> 5...
## $ you_perceive_yourself__attractive                                 <dbl> 6...
## $ you_perceive_yourself__sincere                                    <dbl> 8...
## $ you_perceive_yourself__fun                                        <dbl> 8...
## $ you_perceive_yourself__intelligent                                <dbl> 8...
## $ you_perceive_yourself__ambitious                                  <dbl> 7...
## $ others_perceive_you__attractive                                   <lgl> N...
## $ others_perceive_you__sincere                                      <lgl> N...
## $ others_perceive_you__intelligent                                  <lgl> N...
## $ others_perceive_you__fun                                          <lgl> N...
## $ others_perceive_you__ambitious                                    <lgl> N...
## $ i_liked_partner                                                   <dbl> 1...
## $ i_found_partner__attractive                                       <dbl> 6...
## $ i_found_partner__sincere                                          <dbl> 9...
## $ i_found_partner__intelligent                                      <dbl> 7...
## $ i_found_partner__fun                                              <dbl> 7...
## $ i_found_partner__ambitious                                        <dbl> 6...
## $ i_found_partner__interests                                        <dbl> 5...
## $ degree_i_liked_partner                                            <dbl> 7...
## $ probability_i_find_partner_liked_me                               <dbl> 6...
## $ met_before                                                        <dbl> 2...
## $ n_matches_you_think                                               <dbl> 4...
## $ you_look_for_half_way__attractive                                 <lgl> N...
## $ you_look_for_half_way__sincere                                    <lgl> N...
## $ you_look_for_half_way__intelligent                                <lgl> N...
## $ you_look_for_half_way__fun                                        <lgl> N...
## $ you_look_for_half_way__ambitious                                  <lgl> N...
## $ you_look_for_half_way__shared_interests                           <lgl> N...
## $ you_perceive_yourself_half_way__attractive                        <lgl> N...
## $ you_perceive_yourself_half_way__sincere                           <lgl> N...
## $ you_perceive_yourself_half_way__intelligent                       <lgl> N...
## $ you_perceive_yourself_half_way__fun                               <lgl> N...
## $ you_perceive_yourself_half_way__ambitious                         <lgl> N...
## $ satisfaction_with_partners                                        <dbl> 6...
## $ opinion_duration_of_date                                          <dbl> 2...
## $ opinion_num_dates                                                 <dbl> 1...
## $ actual_importance__attractive                                     <lgl> N...
## $ actual_importance__sincere                                        <lgl> N...
## $ actual_importance__intelligent                                    <lgl> N...
## $ actual_importance__fun                                            <lgl> N...
## $ actual_importance__ambitious                                      <lgl> N...
## $ actual_importance__shared_interests                               <lgl> N...
## $ you_look_for_follow_up__attractive                                <dbl> 1...
## $ you_look_for_follow_up__sincere                                   <dbl> 1...
## $ you_look_for_follow_up__intelligent                               <dbl> 1...
## $ you_look_for_follow_up__fun                                       <dbl> 2...
## $ you_look_for_follow_up__ambitious                                 <dbl> 1...
## $ you_look_for_follow_up__shared_interests                          <dbl> 1...
## $ competitors_look_for_follow_up__attractive                        <lgl> N...
## $ competitors_look_for_follow_up__sincere                           <lgl> N...
## $ competitors_look_for_follow_up__intelligent                       <lgl> N...
## $ competitors_look_for_follow_up__fun                               <lgl> N...
## $ competitors_look_for_follow_up__ambitious                         <lgl> N...
## $ competitors_look_for_follow_up__shared_interests                  <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__attractive             <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__sincere                <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__intelligent            <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__fun                    <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__ambitious              <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__shared_interests       <lgl> N...
## $ you_perceive_yourself_follow_up__attractive                       <dbl> 6...
## $ you_perceive_yourself_follow_up__sincere                          <dbl> 7...
## $ you_perceive_yourself_follow_up__intelligent                      <dbl> 8...
## $ you_perceive_yourself_follow_up__fun                              <dbl> 7...
## $ you_perceive_yourself_follow_up__ambitious                        <dbl> 6...
## $ others_perceive_you_follow_up__attractive                         <lgl> N...
## $ others_perceive_you_follow_up__sincere                            <lgl> N...
## $ others_perceive_you_follow_up__intelligent                        <lgl> N...
## $ others_perceive_you_follow_up__fun                                <lgl> N...
## $ others_perceive_you_follow_up__ambitious                          <lgl> N...
## $ num_matches_you_called                                            <dbl> 1...
## $ num_matches_called_you                                            <dbl> 1...
## $ have_you_dated                                                    <dbl> 0...
## $ numdat_3                                                          <lgl> N...
## $ num_in_3                                                          <dbl> N...
## $ you_look_for_follow_up_weeks__attractive                          <dbl> 1...
## $ you_look_for_follow_up_weeks__sincere                             <dbl> 2...
## $ you_look_for_follow_up_weeks__intelligent                         <dbl> 2...
## $ you_look_for_follow_up_weeks__fun                                 <dbl> 1...
## $ you_look_for_follow_up_weeks__ambitious                           <dbl> 1...
## $ you_look_for_follow_up_weeks__shared_interests                    <dbl> 1...
## $ actual_importance_follow_up_weeks__attractive                     <lgl> N...
## $ actual_importance_follow_up_weeks__sincere                        <lgl> N...
## $ actual_importance_follow_up_weeks__intelligent                    <lgl> N...
## $ actual_importance_follow_up_weeks__fun                            <lgl> N...
## $ actual_importance_follow_up_weeks__ambitious                      <lgl> N...
## $ actual_importance_follow_up_weeks__shared_interests               <lgl> N...
## $ competitors_look_for_follow_up_weeks__attractive                  <lgl> N...
## $ competitors_look_for_follow_up_weeks__sincere                     <lgl> N...
## $ competitors_look_for_follow_up_weeks__intelligent                 <lgl> N...
## $ competitors_look_for_follow_up_weeks__fun                         <lgl> N...
## $ competitors_look_for_follow_up_weeks__ambitious                   <lgl> N...
## $ competitors_look_for_follow_up_weeks__shared_interests            <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__attractive       <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__sincere          <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__intelligent      <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__fun              <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__ambitious        <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__shared_interests <lgl> N...
## $ you_perceive_yourself_follow_up_weeks__attractive                 <dbl> 5...
## $ you_perceive_yourself_follow_up_weeks__sincere                    <dbl> 7...
## $ you_perceive_yourself_follow_up_weeks__intelligent                <dbl> 7...
## $ you_perceive_yourself_follow_up_weeks__fun                        <dbl> 7...
## $ you_perceive_yourself_follow_up_weeks__ambitious                  <dbl> 7...
## $ others_perceive_you_follow_up_weeks__attractive                   <lgl> N...
## $ others_perceive_you_follow_up_weeks__sincere                      <lgl> N...
## $ others_perceive_you_follow_up_weeks__intelligent                  <lgl> N...
## $ others_perceive_you_follow_up_weeks__fun                          <lgl> N...
## $ others_perceive_you_follow_up_weeks__ambitious                    <lgl> N...

Traduzindo códigos para fatores com strings

Muitos atributos estão codificados numericamente, o que atrapalha a interpretação, eles foram transformados em vetores de caracteres.

Essa codificação numérica é muito comum em produtos de análise estatística de prateleira, que possibilitam point-and-click.

Reparem que o atributo frequency_date foi transformado em um fator onde os levels tem uma ordem espcífica. Isso trará implicações posteriores.

  dados_speed_date_fatores <- dados_speed_date_rename_with %>%  mutate(
    choice = if_else(choice == 1, "limited", "extensive") ,
    field_factor = case_when(
      cod_field == 1 ~ "Law",
      cod_field == 2 ~ "Math",
      cod_field == 3 ~ "Social Science, Psychologist" ,
      cod_field == 4 ~ "Medical Science, Pharmaceuticals, and Bio Tech" ,
      cod_field == 5 ~ "Engineering"  ,
      cod_field == 6 ~ "English/Creative Writing/ Journalism" ,
      cod_field == 7 ~ "History/Religion/Philosophy" ,
      cod_field == 8 ~ "Business/Econ/Finance" ,
      cod_field == 9 ~ "Education, Academia" ,
      cod_field == 10 ~ "Biological Sciences/Chemistry/Physics",
      cod_field == 11 ~ "Social Work" ,
      cod_field == 12 ~ "Undergrad/undecided" ,
      cod_field == 13 ~ "Political Science/International Affairs" ,
      cod_field == 14 ~ "Film",
      cod_field == 15 ~ "Fine Arts/Arts Administration",
      cod_field == 16 ~ "Languages",
      cod_field == 17 ~ "Architecture",
      cod_field == 18 ~ "Other"
    ),
    
    race = case_when(
      race == 1 ~ "Black",
      race == 2 ~ "White",
      race == 3 ~ "Latino", 
      race == 4 ~ "Asian" ,
      race == 5 ~ "Native American"  ,
      race == 6 ~ "Others" 
    ),
    
    partner_race = case_when(
      partner_race == 1 ~ "Black",
      partner_race == 2 ~ "White",
      partner_race == 3 ~ "Latino", 
      partner_race == 4 ~ "Asian" ,
      partner_race == 5 ~ "Native American"  ,
      partner_race == 6 ~ "Others" 
    ),
    
    goal = case_when(
      goal == 1 ~ "Fun",
      goal == 2 ~ "Meet new people",
      goal == 3 ~ "Date",
      goal == 4 ~ "Serious",
      goal == 5 ~ "To say",
      goal == 6 ~ "Other"
    ),
    
    cod_frequency_date = frequency_date
    
    ,

    frequency_date = 
      case_when(
        frequency_date == 1 ~ "Several a week",
        frequency_date == 2 ~ "Twice a week",
        frequency_date == 3 ~ "Once a week",
        frequency_date == 4 ~ "Twice a month",
        frequency_date == 5 ~ "Once a month",
        frequency_date == 6 ~ "Several a year",
        frequency_date == 7 ~ "Never"
      ) %>% 
      factor(
        level = c(
          "Several a week",
          "Twice a week",
          "Once a week",
          "Twice a month",
          "Once a month",
          "Several a year",
          "Never"
        ),
        ordered = TRUE
      ) 
    ,
    
    
    frequency_go_out = 
      case_when(
        frequency_go_out == 1 ~ "Several a week",
        frequency_go_out   == 2 ~ "Twice a week",
        frequency_date == 3 ~ "Once a week",
        frequency_date == 4 ~ "Twice a month",
        frequency_date == 5 ~ "Once a month",
        frequency_date == 6 ~ "Several a year",
        frequency_date == 7 ~ "Never"
      ) %>% 
      factor(
        level = c(
          "Several a week",
          "Twice a week",
          "Once a week",
          "Twice a month",
          "Once a month",
          "Several a year",
          "Never"
        ),
        ordered = TRUE
      ) ,

    
    
    career = str_to_title(career),

    career_macro = 
      case_when(
        career_macro == 1 ~ "Lawyer",
        career_macro == 2 ~ "Academic/Research",
        career_macro == 3 ~ "Psychologist" ,
        career_macro == 4 ~ "Doctor/Medicine" ,
        career_macro == 5 ~ "Engineer" ,
        career_macro == 6 ~ "Creative Arts/Entertainment" ,
        career_macro == 7 ~ "Banking/Consulting/Finance/Marketing/Business/CEO/Entrepreneur/Admin" ,
        career_macro == 8 ~ "Real Estate" ,
        career_macro == 9 ~ "International/Humanitarian Affairs" ,
        career_macro == 10 ~ "Undecided" ,
        career_macro == 11 ~ "Social Work",
        career_macro == 12 ~ "Speech Pathology",
        career_macro == 13 ~ "Politics",
        career_macro == 14 ~ "Pro sports/Athletics",
        career_macro == 15 ~ "Other",
        career_macro == 16 ~ "Journalism",
        career_macro == 17 ~ "Architecture"
    ),
    
    met_before = if_else(met_before == 1, TRUE, FALSE),
    
    opinion_duration_of_date = case_when(
      opinion_duration_of_date == 1 ~ "Too little",
      opinion_duration_of_date == 2 ~ "Too much",
      opinion_duration_of_date == 3 ~ "Just Right",
    ),
    
    opinion_num_dates = case_when(
      opinion_num_dates == 1 ~ "Too few",
      opinion_num_dates == 2 ~ "Too many"
    ),
    
    have_you_dated = case_when(
      have_you_dated == 1 ~ TRUE,
      have_you_dated == 2 ~ FALSE
    )
    ,
    sex = if_else(male > 0, "Homem", "Mulher") %>%  as_factor()
  ) %>% 
  select(
    match,
    unique_id_number,
    id_within_wave,
    sex,
    subject_within_gender,
    choice,
    n_people_met_in_wave,
    position_meeting,
    position_started,
    order_meeting,
    partnet_id_within_wave,
    partner_unique_id_number,
    interests_correlation,
    same_race,
    my_age,
    partner_age,
    partner_race,
    partner_stated_pref_time0_attractive,
    partner_stated_pref_time0_sincere,
    partner_stated_pref_time0_intelligent,
    partner_stated_pref_time0_fun,
    partner_stated_pref_time0_ambitious,
    partner_stated_pref_time0_shared_interests,
    importance_same_race,
    importance_same_religion,
    income_zipcode,
    frequency_date,
    frequency_go_out,
    career_macro,
    happy_expec,
    n_expect_like_you,
    partner_liked_me,
    i_liked_partner,
    i_found_partner__attractive,
    i_found_partner__sincere,
    i_found_partner__intelligent,
    i_found_partner__fun,
    i_found_partner__ambitious,
    i_found_partner__interests,
    partner_found_me__attractive,
    partner_found_me__sincere,
    partner_found_me__intelligent,
    partner_found_me__fun,
    partner_found_me__ambitious,
    partner_found_me__interests,
    probability_i_find_partner_liked_me,
    met_before,
    opinion_duration_of_date,
    opinion_num_dates,
    starts_with("competitors_look_for__"),
    starts_with("you_look_for__"),
    starts_with("opposite_sex_look_for__"),
    starts_with("others_perceive_you__"),
    starts_with("you_perceive_yourself__"),
    starts_with("actual_importance__"),
    choice,
    race,
    goal,
    frequency_date,
    career_macro,
    met_before,
    opinion_duration_of_date,
    opinion_num_dates,
  ) %>% 
  mutate(
    across(
      .cols = where(is.character),
      .fns = as.factor
    )
  ) %>% 
  mutate(
    across(
      .cols = c(match, same_race, partner_liked_me, i_liked_partner) ,
      .fns = as.logical
    )
  ) 


glimpse(dados_speed_date_fatores)
## Rows: 8,378
## Columns: 79
## $ match                                      <lgl> FALSE, FALSE, TRUE, TRUE...
## $ unique_id_number                           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ id_within_wave                             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ sex                                        <fct> Mulher, Mulher, Mulher, ...
## $ subject_within_gender                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ choice                                     <fct> limited, limited, limite...
## $ n_people_met_in_wave                       <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting                           <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ position_started                           <lgl> NA, NA, NA, NA, NA, NA, ...
## $ order_meeting                              <dbl> 4, 3, 10, 5, 7, 6, 1, 2,...
## $ partnet_id_within_wave                     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, ...
## $ partner_unique_id_number                   <dbl> 11, 12, 13, 14, 15, 16, ...
## $ interests_correlation                      <dbl> 0.14, 0.54, 0.16, 0.61, ...
## $ same_race                                  <lgl> FALSE, FALSE, TRUE, FALS...
## $ my_age                                     <dbl> 21, 21, 21, 21, 21, 21, ...
## $ partner_age                                <dbl> 27, 22, 22, 23, 24, 25, ...
## $ partner_race                               <fct> White, White, Asian, Whi...
## $ partner_stated_pref_time0_attractive       <dbl> 35.00, 60.00, 19.00, 30....
## $ partner_stated_pref_time0_sincere          <dbl> 20.00, 0.00, 18.00, 5.00...
## $ partner_stated_pref_time0_intelligent      <dbl> 20.00, 0.00, 19.00, 15.0...
## $ partner_stated_pref_time0_fun              <dbl> 20.00, 40.00, 18.00, 40....
## $ partner_stated_pref_time0_ambitious        <dbl> 0.00, 0.00, 14.00, 5.00,...
## $ partner_stated_pref_time0_shared_interests <dbl> 5.00, 0.00, 12.00, 5.00,...
## $ importance_same_race                       <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ importance_same_religion                   <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ income_zipcode                             <dbl> 69487, 69487, 69487, 694...
## $ frequency_date                             <ord> Never, Never, Never, Nev...
## $ frequency_go_out                           <ord> Several a week, Several ...
## $ career_macro                               <fct> NA, NA, NA, NA, NA, NA, ...
## $ happy_expec                                <dbl> 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ n_expect_like_you                          <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ partner_liked_me                           <lgl> FALSE, FALSE, TRUE, TRUE...
## $ i_liked_partner                            <lgl> TRUE, TRUE, TRUE, TRUE, ...
## $ i_found_partner__attractive                <dbl> 6, 7, 5, 7, 5, 4, 7, 4, ...
## $ i_found_partner__sincere                   <dbl> 9, 8, 8, 6, 6, 9, 6, 9, ...
## $ i_found_partner__intelligent               <dbl> 7, 7, 9, 8, 7, 7, 7, 7, ...
## $ i_found_partner__fun                       <dbl> 7, 8, 8, 7, 7, 4, 4, 6, ...
## $ i_found_partner__ambitious                 <dbl> 6, 5, 5, 6, 6, 6, 6, 5, ...
## $ i_found_partner__interests                 <dbl> 5, 6, 7, 8, 6, 4, 7, 6, ...
## $ partner_found_me__attractive               <dbl> 6, 7, 10, 7, 8, 7, 3, 6,...
## $ partner_found_me__sincere                  <dbl> 8, 8, 10, 8, 7, 7, 6, 7,...
## $ partner_found_me__intelligent              <dbl> 8, 10, 10, 9, 9, 8, 7, 5...
## $ partner_found_me__fun                      <dbl> 8, 7, 10, 8, 6, 8, 5, 6,...
## $ partner_found_me__ambitious                <dbl> 8, 7, 10, 9, 9, 7, 8, 8,...
## $ partner_found_me__interests                <dbl> 6, 5, 10, 8, 7, 7, 7, 6,...
## $ probability_i_find_partner_liked_me        <dbl> 6, 5, NA, 6, 6, 5, 5, 7,...
## $ met_before                                 <lgl> FALSE, TRUE, TRUE, FALSE...
## $ opinion_duration_of_date                   <fct> Too much, Too much, Too ...
## $ opinion_num_dates                          <fct> Too few, Too few, Too fe...
## $ competitors_look_for__attractive           <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__sincere              <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__intelligent          <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__fun                  <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__ambitious            <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__shared_interests     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_look_for__attractive                   <dbl> 15, 15, 15, 15, 15, 15, ...
## $ you_look_for__sincere                      <dbl> 20, 20, 20, 20, 20, 20, ...
## $ you_look_for__intelligent                  <dbl> 20, 20, 20, 20, 20, 20, ...
## $ you_look_for__fun                          <dbl> 15, 15, 15, 15, 15, 15, ...
## $ you_look_for__ambitious                    <dbl> 15, 15, 15, 15, 15, 15, ...
## $ you_look_for__shared_interests             <dbl> 15, 15, 15, 15, 15, 15, ...
## $ others_perceive_you__attractive            <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__sincere               <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__intelligent           <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__fun                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__ambitious             <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_perceive_yourself__attractive          <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ you_perceive_yourself__sincere             <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__fun                 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__intelligent         <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__ambitious           <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ actual_importance__attractive              <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__sincere                 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__intelligent             <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__fun                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__ambitious               <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__shared_interests        <lgl> NA, NA, NA, NA, NA, NA, ...
## $ race                                       <fct> Asian, Asian, Asian, Asi...
## $ goal                                       <fct> Meet new people, Meet ne...

Compatibilizando normalizações diferentes

Algumas perguntas foram feitas de forma inconsistente ao longo dos dias da pesquisa.

Em alguns dias foi dado um orçamento de x pontos para os entrevistados distribuírem nos atributos de mesmo tipo, em outros foi dado um orçamento pra cada atributo.

normaliza_no_prefixo <-  function(
  df = dados_com_representacao , 
  prefixo = "partner_stated_pref_time0_" ){
  

  dados_speed_date_normalizada <- df %>%
    rowwise() %>% 
    mutate(
      "{prefixo}_soma" := 
        sum(c_across(starts_with(prefixo)), na.rm = TRUE)
    ) %>% 
    mutate(
      across(
        .cols = starts_with(prefixo),
        .fns = ~.x / .data[[str_glue("{prefixo}_soma")]]
      )
    ) %>% 
    select(
      -contains(str_glue("{prefixo}_soma"))
    ) 
    
}

dados_speed_date_normalizada <- dados_speed_date_fatores %>%  
  normaliza_no_prefixo("partner_stated_pref_time0_" ) %>% 
  normaliza_no_prefixo("you_look_for__" ) %>% 
  normaliza_no_prefixo("opposite_sex_look_for__" ) %>% 
  ungroup()


glimpse(dados_speed_date_normalizada)
## Rows: 8,378
## Columns: 79
## $ match                                      <lgl> FALSE, FALSE, TRUE, TRUE...
## $ unique_id_number                           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ id_within_wave                             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ sex                                        <fct> Mulher, Mulher, Mulher, ...
## $ subject_within_gender                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ choice                                     <fct> limited, limited, limite...
## $ n_people_met_in_wave                       <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting                           <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ position_started                           <lgl> NA, NA, NA, NA, NA, NA, ...
## $ order_meeting                              <dbl> 4, 3, 10, 5, 7, 6, 1, 2,...
## $ partnet_id_within_wave                     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, ...
## $ partner_unique_id_number                   <dbl> 11, 12, 13, 14, 15, 16, ...
## $ interests_correlation                      <dbl> 0.14, 0.54, 0.16, 0.61, ...
## $ same_race                                  <lgl> FALSE, FALSE, TRUE, FALS...
## $ my_age                                     <dbl> 21, 21, 21, 21, 21, 21, ...
## $ partner_age                                <dbl> 27, 22, 22, 23, 24, 25, ...
## $ partner_race                               <fct> White, White, Asian, Whi...
## $ partner_stated_pref_time0_attractive       <dbl> 0.3500000, 0.6000000, 0....
## $ partner_stated_pref_time0_sincere          <dbl> 0.2000000, 0.0000000, 0....
## $ partner_stated_pref_time0_intelligent      <dbl> 0.2000000, 0.0000000, 0....
## $ partner_stated_pref_time0_fun              <dbl> 0.2000000, 0.4000000, 0....
## $ partner_stated_pref_time0_ambitious        <dbl> 0.0000000, 0.0000000, 0....
## $ partner_stated_pref_time0_shared_interests <dbl> 0.0500000, 0.0000000, 0....
## $ importance_same_race                       <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ importance_same_religion                   <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ income_zipcode                             <dbl> 69487, 69487, 69487, 694...
## $ frequency_date                             <ord> Never, Never, Never, Nev...
## $ frequency_go_out                           <ord> Several a week, Several ...
## $ career_macro                               <fct> NA, NA, NA, NA, NA, NA, ...
## $ happy_expec                                <dbl> 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ n_expect_like_you                          <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ partner_liked_me                           <lgl> FALSE, FALSE, TRUE, TRUE...
## $ i_liked_partner                            <lgl> TRUE, TRUE, TRUE, TRUE, ...
## $ i_found_partner__attractive                <dbl> 6, 7, 5, 7, 5, 4, 7, 4, ...
## $ i_found_partner__sincere                   <dbl> 9, 8, 8, 6, 6, 9, 6, 9, ...
## $ i_found_partner__intelligent               <dbl> 7, 7, 9, 8, 7, 7, 7, 7, ...
## $ i_found_partner__fun                       <dbl> 7, 8, 8, 7, 7, 4, 4, 6, ...
## $ i_found_partner__ambitious                 <dbl> 6, 5, 5, 6, 6, 6, 6, 5, ...
## $ i_found_partner__interests                 <dbl> 5, 6, 7, 8, 6, 4, 7, 6, ...
## $ partner_found_me__attractive               <dbl> 6, 7, 10, 7, 8, 7, 3, 6,...
## $ partner_found_me__sincere                  <dbl> 8, 8, 10, 8, 7, 7, 6, 7,...
## $ partner_found_me__intelligent              <dbl> 8, 10, 10, 9, 9, 8, 7, 5...
## $ partner_found_me__fun                      <dbl> 8, 7, 10, 8, 6, 8, 5, 6,...
## $ partner_found_me__ambitious                <dbl> 8, 7, 10, 9, 9, 7, 8, 8,...
## $ partner_found_me__interests                <dbl> 6, 5, 10, 8, 7, 7, 7, 6,...
## $ probability_i_find_partner_liked_me        <dbl> 6, 5, NA, 6, 6, 5, 5, 7,...
## $ met_before                                 <lgl> FALSE, TRUE, TRUE, FALSE...
## $ opinion_duration_of_date                   <fct> Too much, Too much, Too ...
## $ opinion_num_dates                          <fct> Too few, Too few, Too fe...
## $ competitors_look_for__attractive           <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__sincere              <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__intelligent          <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__fun                  <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__ambitious            <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__shared_interests     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_look_for__attractive                   <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ you_look_for__sincere                      <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__intelligent                  <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__fun                          <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ you_look_for__ambitious                    <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ you_look_for__shared_interests             <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ others_perceive_you__attractive            <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__sincere               <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__intelligent           <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__fun                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__ambitious             <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_perceive_yourself__attractive          <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ you_perceive_yourself__sincere             <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__fun                 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__intelligent         <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__ambitious           <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ actual_importance__attractive              <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__sincere                 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__intelligent             <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__fun                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__ambitious               <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__shared_interests        <lgl> NA, NA, NA, NA, NA, NA, ...
## $ race                                       <fct> Asian, Asian, Asian, Asi...
## $ goal                                       <fct> Meet new people, Meet ne...

Estatísticas sobre o resumo dos dados

A biblioteca skim, com a função skimr(), oferece uma boa forma de ver um resumo com a característica dos dados

skim(dados_speed_date_normalizada)
Data summary
Name dados_speed_date_normaliz…
Number of rows 8378
Number of columns 79
_______________________
Column type frequency:
factor 10
logical 23
numeric 46
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1.00 FALSE 2 Hom: 4194, Mul: 4184
choice 0 1.00 FALSE 2 ext: 6944, lim: 1434
partner_race 73 0.99 FALSE 5 Whi: 4722, Asi: 1978, Lat: 664, Oth: 521
frequency_date 97 0.99 TRUE 7 Sev: 2094, Twi: 2040, Onc: 1528, Nev: 1434
frequency_go_out 2778 0.67 TRUE 2 Twi: 2990, Sev: 2610, Onc: 0, Twi: 0
career_macro 138 0.98 FALSE 17 Aca: 2320, Ban: 2170, Cre: 724, Law: 675
opinion_duration_of_date 915 0.89 FALSE 3 Too: 4227, Jus: 3059, Too: 177
opinion_num_dates 4107 0.51 FALSE 2 Too: 3622, Too: 649
race 63 0.99 FALSE 5 Whi: 4727, Asi: 1982, Lat: 664, Oth: 522
goal 79 0.99 FALSE 6 Fun: 3426, Mee: 3012, Dat: 631, To : 510

Variable type: logical

skim_variable n_missing complete_rate mean count
match 0 1.00 0.16 FAL: 6998, TRU: 1380
position_started 7974 0.05 1.00 TRU: 404
same_race 0 1.00 0.40 FAL: 5062, TRU: 3316
partner_liked_me 0 1.00 0.42 FAL: 4863, TRU: 3515
i_liked_partner 0 1.00 0.42 FAL: 4860, TRU: 3518
met_before 375 0.96 0.04 FAL: 7652, TRU: 351
competitors_look_for__attractive 8378 0.00 NaN :
competitors_look_for__sincere 7997 0.05 0.05 FAL: 363, TRU: 18
competitors_look_for__intelligent 8204 0.02 0.28 FAL: 125, TRU: 49
competitors_look_for__fun 8319 0.01 0.31 FAL: 41, TRU: 18
competitors_look_for__ambitious 7693 0.08 0.18 FAL: 563, TRU: 122
competitors_look_for__shared_interests 8059 0.04 0.15 FAL: 271, TRU: 48
others_perceive_you__attractive 8378 0.00 NaN :
others_perceive_you__sincere 8368 0.00 1.00 TRU: 10
others_perceive_you__intelligent 8378 0.00 NaN :
others_perceive_you__fun 8378 0.00 NaN :
others_perceive_you__ambitious 8363 0.00 1.00 TRU: 15
actual_importance__attractive 8378 0.00 NaN :
actual_importance__sincere 8205 0.02 0.00 FAL: 173
actual_importance__intelligent 8297 0.01 0.00 FAL: 81
actual_importance__fun 8344 0.00 0.00 FAL: 34
actual_importance__ambitious 7842 0.06 0.00 FAL: 536
actual_importance__shared_interests 8165 0.03 0.00 FAL: 213

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
unique_id_number 0 1.00 283.68 158.58 1.00 154.00 281.00 407.00 552.00 ▇▇▇▇▇
id_within_wave 1 1.00 8.96 5.49 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
subject_within_gender 0 1.00 17.33 10.94 1.00 8.00 16.00 26.00 44.00 ▇▇▅▅▂
n_people_met_in_wave 0 1.00 16.87 4.36 5.00 14.00 18.00 20.00 22.00 ▁▃▂▅▇
position_meeting 0 1.00 9.04 5.51 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
order_meeting 0 1.00 8.93 5.48 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
partnet_id_within_wave 0 1.00 8.96 5.49 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
partner_unique_id_number 10 1.00 283.86 158.58 1.00 154.00 281.00 408.00 552.00 ▇▇▇▇▇
interests_correlation 158 0.98 0.20 0.30 -0.83 -0.02 0.21 0.43 0.91 ▁▃▇▇▂
my_age 95 0.99 26.36 3.57 18.00 24.00 26.00 28.00 55.00 ▇▇▁▁▁
partner_age 104 0.99 26.36 3.56 18.00 24.00 26.00 28.00 55.00 ▇▇▁▁▁
partner_stated_pref_time0_attractive 89 0.99 0.22 0.13 0.00 0.15 0.20 0.25 1.00 ▇▃▁▁▁
partner_stated_pref_time0_sincere 89 0.99 0.17 0.07 0.00 0.15 0.18 0.20 0.60 ▃▇▂▁▁
partner_stated_pref_time0_intelligent 89 0.99 0.20 0.07 0.00 0.17 0.20 0.23 0.50 ▂▇▃▁▁
partner_stated_pref_time0_fun 98 0.99 0.17 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
partner_stated_pref_time0_ambitious 107 0.99 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▇▇▇▁▁
partner_stated_pref_time0_shared_interests 129 0.98 0.12 0.06 0.00 0.10 0.11 0.16 0.30 ▆▇▇▃▁
importance_same_race 79 0.99 3.78 2.85 0.00 1.00 3.00 6.00 10.00 ▇▃▂▂▂
importance_same_religion 79 0.99 3.65 2.81 1.00 1.00 3.00 6.00 10.00 ▇▃▃▂▁
income_zipcode 4099 0.51 44887.61 17206.92 8607.00 31516.00 43185.00 54303.00 109031.00 ▃▇▅▂▁
happy_expec 101 0.99 5.53 1.73 1.00 5.00 6.00 7.00 10.00 ▁▃▇▃▁
n_expect_like_you 6578 0.21 5.57 4.76 0.00 2.00 4.00 8.00 20.00 ▇▃▂▁▁
i_found_partner__attractive 202 0.98 6.19 1.95 0.00 5.00 6.00 8.00 10.00 ▁▃▇▇▂
i_found_partner__sincere 277 0.97 7.18 1.74 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__intelligent 296 0.96 7.37 1.55 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__fun 350 0.96 6.40 1.95 0.00 5.00 7.00 8.00 10.00 ▁▂▇▇▂
i_found_partner__ambitious 712 0.92 6.78 1.79 0.00 6.00 7.00 8.00 10.00 ▁▂▆▇▃
i_found_partner__interests 1067 0.87 5.47 2.16 0.00 4.00 6.00 7.00 10.00 ▂▅▇▆▂
partner_found_me__attractive 212 0.97 6.19 1.95 0.00 5.00 6.00 8.00 10.50 ▁▃▇▇▂
partner_found_me__sincere 287 0.97 7.18 1.74 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__intelligent 306 0.96 7.37 1.55 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__fun 360 0.96 6.40 1.95 0.00 5.00 7.00 8.00 11.00 ▁▂▇▇▂
partner_found_me__ambitious 722 0.91 6.78 1.79 0.00 6.00 7.00 8.00 10.00 ▁▂▆▇▃
partner_found_me__interests 1076 0.87 5.47 2.16 0.00 4.00 6.00 7.00 10.00 ▂▅▇▆▂
probability_i_find_partner_liked_me 309 0.96 5.21 2.13 0.00 4.00 5.00 7.00 10.00 ▂▅▇▅▁
you_look_for__attractive 79 0.99 0.23 0.13 0.00 0.15 0.20 0.25 1.00 ▇▃▁▁▁
you_look_for__sincere 79 0.99 0.17 0.07 0.00 0.15 0.18 0.20 0.60 ▃▇▂▁▁
you_look_for__intelligent 79 0.99 0.20 0.07 0.00 0.17 0.20 0.23 0.50 ▂▇▃▁▁
you_look_for__fun 89 0.99 0.17 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
you_look_for__ambitious 99 0.99 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▇▇▇▁▁
you_look_for__shared_interests 121 0.99 0.12 0.06 0.00 0.10 0.11 0.16 0.30 ▆▇▇▃▁
you_perceive_yourself__attractive 105 0.99 7.08 1.40 2.00 6.00 7.00 8.00 10.00 ▁▂▂▇▂
you_perceive_yourself__sincere 105 0.99 8.29 1.41 2.00 8.00 8.00 9.00 10.00 ▁▁▁▆▇
you_perceive_yourself__fun 105 0.99 7.70 1.56 2.00 7.00 8.00 9.00 10.00 ▁▁▂▇▆
you_perceive_yourself__intelligent 105 0.99 8.40 1.08 3.00 8.00 8.00 9.00 10.00 ▁▁▃▆▇
you_perceive_yourself__ambitious 105 0.99 7.58 1.78 2.00 7.00 8.00 9.00 10.00 ▁▂▂▇▆

Uma boa forma de ver um resumo dos dados

A função skim() devolve um tibble, que pode ser usado para extrair estatísticas da base

resumo <- skim(dados_speed_date_normalizada)


glimpse(resumo)
## Rows: 79
## Columns: 17
## $ skim_type         <chr> "factor", "factor", "factor", "factor", "factor",...
## $ skim_variable     <chr> "sex", "choice", "partner_race", "frequency_date"...
## $ n_missing         <int> 0, 0, 73, 97, 2778, 138, 915, 4107, 63, 79, 0, 79...
## $ complete_rate     <dbl> 1.000000000, 1.000000000, 0.991286703, 0.98842205...
## $ factor.ordered    <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FA...
## $ factor.n_unique   <int> 2, 2, 5, 7, 2, 17, 3, 2, 5, 6, NA, NA, NA, NA, NA...
## $ factor.top_counts <chr> "Hom: 4194, Mul: 4184", "ext: 6944, lim: 1434", "...
## $ logical.mean      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.1647171...
## $ logical.count     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "FAL: 699...
## $ numeric.mean      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.sd        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p0        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p25       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p50       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p75       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p100      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.hist      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...

Podemos ver que temos muito campos quase completos e alguns campos bem menos preenchidos.

De modo geral, são campos que foram preenchidos numa pesquisa feita semanas depois do evento.

ggplot(resumo) +
  geom_density(
    aes(
      x = complete_rate
    ),
    adjust = 0.1
  ) +
  theme_minimal()

Retiramos, então os dados com pouca representação

campos_com_representacao <-  resumo %>% 
  filter(
    complete_rate > 0.75
  )


dados_com_representacao <-  dados_speed_date_normalizada %>% 
  select(
    campos_com_representacao$skim_variable
  )

Adicionando dados do parceiro

Queremos ter algumas impressões do parceiro no nosso conjunto de dados, e assim fazemos o resumo final para começar a brincar com os dados.

dados_speed_date_partner_side <- dados_speed_date_normalizada %>% 
  select(
    unique_id_number, 
    partner_unique_id_number,
    probability_partner_find_i_liked_partner = probability_i_find_partner_liked_me,
    partner_career_macro = career_macro,
    starts_with("you_perceive_yourself__")
  ) %>% 
  rename_with(
    .cols = starts_with("you_perceive_yourself__"),
    .fn = ~str_replace(.x, "you_perceive_yourself__", "partner_perceives_himself__")
  )


dados_finais <- dados_com_representacao %>% 
  left_join(
    dados_speed_date_partner_side,
    by = c("unique_id_number" = "partner_unique_id_number", "partner_unique_id_number" = "unique_id_number"  )
  ) %>% 
  filter(
    across(
      .cols = everything(),
      .fns = ~!is.na(.x)
    )
  ) %>% 
  mutate(
    across(
      .cols = where(is.logical) ,
      .fns = as.numeric
    )
  )
  



resumo_com_representacao <-  skim(dados_finais)

resumo_com_representacao
Data summary
Name dados_finais
Number of rows 4885
Number of columns 64
_______________________
Column type frequency:
factor 9
numeric 55
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1 FALSE 2 Hom: 2456, Mul: 2429
choice 0 1 FALSE 2 ext: 4099, lim: 786
partner_race 0 1 FALSE 5 Whi: 2683, Asi: 1210, Lat: 386, Oth: 357
frequency_date 0 1 TRUE 7 Twi: 1301, Sev: 1242, Onc: 910, Nev: 793
career_macro 0 1 FALSE 17 Aca: 1472, Ban: 1236, Cre: 419, Law: 401
opinion_duration_of_date 0 1 FALSE 3 Too: 2786, Jus: 1996, Too: 103
race 0 1 FALSE 5 Whi: 2687, Asi: 1190, Lat: 406, Oth: 372
goal 0 1 FALSE 6 Fun: 2054, Mee: 1793, Dat: 372, To : 273
partner_career_macro 0 1 FALSE 17 Aca: 1396, Ban: 1292, Cre: 408, Law: 379

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
match 0 1 0.18 0.38 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
same_race 0 1 0.39 0.49 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
partner_liked_me 0 1 0.44 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
i_liked_partner 0 1 0.45 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
met_before 0 1 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
unique_id_number 0 1 283.07 156.88 4.00 160.00 274.00 411.00 552.00 ▇▆▇▆▇
id_within_wave 0 1 9.10 5.57 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
subject_within_gender 0 1 17.61 11.09 1.00 8.00 16.00 26.00 44.00 ▇▇▅▅▂
n_people_met_in_wave 0 1 17.01 4.33 5.00 14.00 18.00 20.00 22.00 ▁▂▂▅▇
position_meeting 0 1 9.13 5.50 1.00 4.00 9.00 13.00 22.00 ▇▆▅▅▂
order_meeting 0 1 8.87 5.46 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
partnet_id_within_wave 0 1 9.12 5.51 1.00 5.00 9.00 13.00 22.00 ▇▆▅▃▂
partner_unique_id_number 0 1 282.91 156.98 4.00 158.00 274.00 411.00 552.00 ▇▆▇▆▇
interests_correlation 0 1 0.20 0.30 -0.83 -0.02 0.22 0.43 0.91 ▁▃▇▇▂
my_age 0 1 26.16 3.44 18.00 24.00 26.00 28.00 55.00 ▇▇▁▁▁
partner_age 0 1 26.19 3.41 18.00 24.00 26.00 28.00 55.00 ▇▇▁▁▁
partner_stated_pref_time0_attractive 0 1 0.22 0.11 0.00 0.15 0.20 0.25 1.00 ▇▃▁▁▁
partner_stated_pref_time0_sincere 0 1 0.18 0.07 0.00 0.15 0.18 0.20 0.47 ▂▇▇▁▁
partner_stated_pref_time0_intelligent 0 1 0.20 0.07 0.00 0.18 0.20 0.24 0.50 ▁▇▃▁▁
partner_stated_pref_time0_fun 0 1 0.17 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
partner_stated_pref_time0_ambitious 0 1 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▆▇▇▁▁
partner_stated_pref_time0_shared_interests 0 1 0.12 0.06 0.00 0.10 0.11 0.16 0.30 ▅▇▇▃▁
importance_same_race 0 1 3.83 2.83 1.00 1.00 3.00 6.00 10.00 ▇▃▃▂▂
importance_same_religion 0 1 3.61 2.85 1.00 1.00 3.00 6.00 10.00 ▇▂▃▂▂
happy_expec 0 1 5.49 1.78 1.00 5.00 6.00 7.00 10.00 ▁▃▇▅▁
i_found_partner__attractive 0 1 6.25 1.94 0.00 5.00 6.00 8.00 10.00 ▁▃▇▇▂
i_found_partner__sincere 0 1 7.22 1.72 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__intelligent 0 1 7.43 1.52 0.00 7.00 8.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__fun 0 1 6.48 1.94 0.00 5.00 7.00 8.00 10.00 ▁▂▇▇▃
i_found_partner__ambitious 0 1 6.82 1.79 0.00 6.00 7.00 8.00 10.00 ▁▁▆▇▃
i_found_partner__interests 0 1 5.55 2.14 0.00 4.00 6.00 7.00 10.00 ▂▅▇▆▂
partner_found_me__attractive 0 1 6.21 1.93 0.00 5.00 6.00 8.00 10.00 ▁▃▇▇▂
partner_found_me__sincere 0 1 7.17 1.74 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__intelligent 0 1 7.39 1.53 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__fun 0 1 6.43 1.94 0.00 5.00 7.00 8.00 11.00 ▁▂▇▇▂
partner_found_me__ambitious 0 1 6.76 1.79 0.00 6.00 7.00 8.00 10.00 ▁▂▆▇▃
partner_found_me__interests 0 1 5.50 2.13 0.00 4.00 6.00 7.00 10.00 ▂▅▇▆▂
probability_i_find_partner_liked_me 0 1 5.33 2.12 0.00 4.00 5.00 7.00 10.00 ▂▅▇▅▁
you_look_for__attractive 0 1 0.22 0.11 0.00 0.15 0.20 0.25 0.90 ▇▇▁▁▁
you_look_for__sincere 0 1 0.18 0.07 0.00 0.15 0.18 0.20 0.47 ▁▇▇▁▁
you_look_for__intelligent 0 1 0.20 0.07 0.00 0.18 0.20 0.23 0.50 ▁▇▃▁▁
you_look_for__fun 0 1 0.17 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
you_look_for__ambitious 0 1 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▆▇▇▁▁
you_look_for__shared_interests 0 1 0.12 0.06 0.00 0.10 0.11 0.16 0.30 ▆▇▇▃▁
you_perceive_yourself__attractive 0 1 7.14 1.41 2.00 6.00 7.00 8.00 10.00 ▁▂▂▇▂
you_perceive_yourself__sincere 0 1 8.33 1.44 2.00 8.00 9.00 9.00 10.00 ▁▁▁▆▇
you_perceive_yourself__fun 0 1 7.77 1.58 2.00 7.00 8.00 9.00 10.00 ▁▁▂▇▆
you_perceive_yourself__intelligent 0 1 8.50 1.08 3.00 8.00 9.00 9.00 10.00 ▁▁▂▅▇
you_perceive_yourself__ambitious 0 1 7.65 1.83 2.00 7.00 8.00 9.00 10.00 ▁▂▂▇▇
probability_partner_find_i_liked_partner 0 1 5.27 2.12 0.00 4.00 5.00 7.00 10.00 ▂▅▇▅▁
partner_perceives_himself__attractive 0 1 7.11 1.40 2.00 6.00 7.00 8.00 10.00 ▁▂▂▇▂
partner_perceives_himself__sincere 0 1 8.33 1.44 2.00 8.00 9.00 9.00 10.00 ▁▁▁▆▇
partner_perceives_himself__fun 0 1 7.73 1.57 2.00 7.00 8.00 9.00 10.00 ▁▁▂▇▆
partner_perceives_himself__intelligent 0 1 8.46 1.08 3.00 8.00 8.00 9.00 10.00 ▁▁▂▆▇
partner_perceives_himself__ambitious 0 1 7.63 1.82 2.00 7.00 8.00 9.00 10.00 ▁▂▂▇▇

Alguma análise exploratória

Podemos ver, por exemplo, se as pessoas têm uma imagem acurada da própria atratividade

escala_sexo = c(Homem = "darkblue", Mulher = "darkred")


dados_finais %>% 
  ggplot(
    aes(
      y = partner_found_me__attractive,
      x = you_perceive_yourself__attractive
    )
  ) +
  geom_boxplot(
    aes(
      group = you_perceive_yourself__attractive,
      color = sex,
      fill = sex,
      alpha = 0.3
    ),
    show.legend = FALSE
  ) +
  scale_color_manual(
    values = escala_sexo
  ) +
  scale_fill_manual(
    values = escala_sexo
  ) +
  stat_smooth(
    method = "loess",
    formula = y ~ x,
    show.legend = FALSE,
    se = FALSE,
    aes(
      color = sex
    )
    
  ) +
  geom_function(
    fun = identity 
  ) +
  facet_wrap(
    ~sex
  ) +
  scale_x_continuous(
    breaks = 0:10
  ) +
  scale_y_continuous(
    breaks = 0:10
  ) +
  labs(
    x = "Me acho bonito",
    y = "Parceiro me acha bonito"
  ) +
  theme_minimal()

Alguma análise exploratória

Como o quanto eu achei o parceiro bom em algum atributo está correlacionado com o fato de eu gostar do parceiro?

dados_grafico_partner_liked <- dados_finais %>% 
  select(
    i_liked_partner,
    starts_with("i_found_partner__"),
    sex
  ) %>% 
  pivot_longer(
    cols = -c(i_liked_partner, sex),
    names_to = "i_found_partner",
    names_pattern = "i_found_partner__(.*)",
    values_to = "degree"
  ) %>% 
  mutate(
    degree = round(degree)
  ) %>% 
  group_by(
    degree,
    i_found_partner,
    sex
  ) %>% 
  summarise(
    i_liked_partner = mean(i_liked_partner),
    n = n()
  ) %>% 
  filter(
    n > 100
  )

  
ggplot(dados_grafico_partner_liked) +
  geom_line(
    aes(
      x = degree,
      y = i_liked_partner,
      color = sex,
    ),
    size = 1.2

  ) +
  geom_point(
    aes(
      x = degree,
      y = i_liked_partner,
      color = sex,
      size = n
    )
  ) +
  facet_wrap(
    ~i_found_partner
  ) +
  theme_minimal() +
  theme(
    legend.position = "top"
  ) +
  scale_x_continuous(
    breaks = 1:10
  ) +
  scale_y_continuous(
    limits = c(0,1),
    breaks = seq(0, to = 1, by = .2),
    labels = percent_format(accuracy = 1)
  ) +
  scale_color_manual(
    values = escala_sexo
  ) +
  labs(
    x = "Gostei deste atributo no parceiro",
    y = "Gostei do parceiro. Quero ele(a)"
  )

Biblioteca Parnsip

Na análise anterior, fizemos a média condicional variável a variável, mas podemos fazer a média condicional a todas as variáveis ao mesmo tempo.

A forma de fazer isso é rodando uma regressão linear de mínimos quadrados ordinários múltipla.

parnsnip é a sucessora do núcleo da caret.

Ela é usada para oferecer uma interface genérica a alguns tipos de modelos de aprensizado estatístico

No caso, escolhemos um modelo linear e usamos como engine a função lm do R

lm_mod <- 
  linear_reg() %>% 
  set_engine("lm")

lm_mod
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Regressão múltipla nos atributos

Agora rodamos efetivamente o modelo

Notem que o modelo é rodado com as interações entre os atributos e a dummy “sex”

A biblioteca yardstick oferece métodos para extrairmos métrica e estimações de dentro dos objetos retornados pelas funções de treinamento da parsnip, como fit()

lm_fit <- 
  lm_mod %>% 
  fit(  i_liked_partner ~ 
        sex +
        i_found_partner__attractive * sex +
        i_found_partner__ambitious * sex +
        i_found_partner__fun * sex +
        i_found_partner__intelligent * sex +
        i_found_partner__interests * sex +
        i_found_partner__sincere * sex ,

  data = dados_finais)


tidy(lm_fit)
## # A tibble: 14 x 5
##    term                                   estimate std.error statistic  p.value
##    <chr>                                     <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                           -0.395      0.0442    -8.92   6.56e-19
##  2 sexHomem                               0.0674     0.0643     1.05   2.95e- 1
##  3 i_found_partner__attractive            0.0673     0.00547   12.3    2.77e-34
##  4 i_found_partner__ambitious            -0.0205     0.00633   -3.25   1.18e- 3
##  5 i_found_partner__fun                   0.0365     0.00629    5.81   6.71e- 9
##  6 i_found_partner__intelligent           0.0189     0.00846    2.23   2.59e- 2
##  7 i_found_partner__interests             0.0479     0.00514    9.31   1.98e-20
##  8 i_found_partner__sincere              -0.0164     0.00648   -2.53   1.13e- 2
##  9 sexHomem:i_found_partner__attractive   0.0472     0.00786    6.00   2.07e- 9
## 10 sexHomem:i_found_partner__ambitious   -0.00375    0.00918   -0.409  6.83e- 1
## 11 sexHomem:i_found_partner__fun         -0.000680   0.00921   -0.0738 9.41e- 1
## 12 sexHomem:i_found_partner__intelligent -0.0312     0.0122    -2.56   1.05e- 2
## 13 sexHomem:i_found_partner__interests   -0.00185    0.00732   -0.252  8.01e- 1
## 14 sexHomem:i_found_partner__sincere     -0.00353    0.00973   -0.363  7.17e- 1

Mais fácil ver em forma de gráfico

dwplot(tidy(lm_fit), dot_args = list(size = 2, color = "darkblue"),
         whisker_args = list(color = "darkblue"),
         vline = geom_vline(xintercept = 0, colour = "darkblue", linetype = 2)) +
  theme_minimal()

Estimando a resposta com novos dados

Podemos usar a função predict() para gerar estimativas para valores de y dados novos valores de x

medias_i_found <- dados_finais %>% 
  select(
    starts_with("i_found_partner__"),
    sex
  ) %>% 
  pivot_longer(
    cols = -c(sex),
    names_to = "i_found_partner",
    names_pattern = "i_found_partner__(.*)",
    values_to = "degree"
  ) %>% 
  mutate(
    degree = as.numeric(degree)
  ) %>% 
  group_by(
    sex,
    i_found_partner
  ) %>% 
  summarise(
    p10 = quantile(degree, probs = 0.1, na.rm = TRUE),
    p90 = quantile(degree, probs = 0.9, na.rm = TRUE),
    p25 = quantile(degree, probs = 0.25, na.rm = TRUE),
    p75 = quantile(degree, probs = 0.75, na.rm = TRUE),
    p33 = quantile(degree, probs = 0.33, na.rm = TRUE),
    p67 = quantile(degree, probs = 0.67, na.rm = TRUE),
    mean = mean(degree, na.rm = TRUE)
  ) %>% 
  pivot_wider(
    names_from = i_found_partner,
    values_from = c(mean, p10, p90, p25, p75, p33, p67)
  ) 
  

med_h <- medias_i_found %>% 
  filter(
    sex == "Homem"
  )
  
med_m <- medias_i_found %>% 
  filter(
    sex == "Mulher"
  )

pontos_novos <- 
  tribble(
    ~attractive,           ~ambitious,     ~fun,      ~intelligent,      ~interests,     ~sincere,       ~sex,    ~nome,
    med_h$mean_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "Média",
    med_h$p10_attractive,  med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P10",
    med_h$p25_attractive,  med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P25",
    med_h$p90_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem",  "P90",
    med_h$p75_attractive,  med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P75",
    med_h$p33_attractive,  med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P33",
    med_h$p67_attractive,  med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P67",
    med_m$mean_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "Média",
    med_m$p10_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P10",
    med_m$p90_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P90",
    med_m$p25_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P25",
    med_m$p75_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P75",
    med_m$p33_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P33",
    med_m$p67_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P67" 
  ) %>% 
  rename_with(
    .cols = -c(sex, nome),
    .fn = ~str_glue("i_found_partner__{.x}")
  ) 
  



conf_int_pred <- predict(lm_fit, 
                         new_data = pontos_novos, 
                         type = "conf_int")

mean_pred <- predict(lm_fit, 
                         new_data = pontos_novos
                         )


dados_pred <- pontos_novos %>% 
  bind_cols(
    conf_int_pred
  ) %>% 
  bind_cols(
    mean_pred
  ) %>% 
  view()


ggplot(dados_pred, aes(x = i_found_partner__attractive)) + 
  geom_point(aes(y = .pred, color = sex)) + 
  geom_errorbar(aes(ymin = .pred_lower, 
                    ymax = .pred_upper, color = sex),
                width = .2) + 
  labs(y = "Prob. I like partner")+
  # geom_mark_circle(
  #   aes(
  #     y = .pred,
  #     label = nome,
  #     group = interaction(sex, nome),
  #     color = sex,
  #     fill = sex
  #   ),
  #   label.fontsize = 7,
  #   con.cap = 1,
  #   expand = 0.001,
  #   label.buffer = unit(1, 'mm'),
  #   show.legend = FALSE
  # ) +
  theme_minimal() +
  theme(
    legend.position = "top"
  ) +
  geom_line(
    aes(
      color = sex,
      y = .pred
    )
  ) +
  scale_color_manual(
    values = escala_sexo
  ) +
  scale_x_continuous(
    breaks = 1:10
  ) +
  scale_y_continuous(
    breaks = seq(0, to = 1, by= 0.2),
    limits = c(0,1),
    label = percent_format(accuracy = 1)
  )

Rodando um modelo mais complexo

Agora vamos sair do modelo linear e rodar uma rede neural

dados_finais_nao_nulos_sex_numerico <- dados_finais %>% 
  mutate(
    sex = if_else(sex == "Homem", 1, 0) ,
    i_liked_partner = as.numeric(i_liked_partner),
  ) %>% 
  filter(
    across(
      .cols = everything(),
      .fns = ~!is.na(.x)
    )
  )

pontos_novos_rand_for <- pontos_novos %>% 
  mutate(
    sex = if_else(sex == "Homem", 1, 0) 
  ) 
  

set.seed(192)

modelo_nnet <- mlp(mode = "regression", hidden_units = 10 ) %>%
  set_engine("nnet") 
  
modelo_nnet
## Single Layer Neural Network Specification (regression)
## 
## Main Arguments:
##   hidden_units = 10
## 
## Computational engine: nnet
fit_nnet <- modelo_nnet %>% fit(  i_liked_partner ~ 
        i_found_partner__attractive +
        i_found_partner__ambitious +
        i_found_partner__fun +
        i_found_partner__intelligent +
        i_found_partner__interests +
        i_found_partner__sincere +
        sex,
        

  data =  dados_finais_nao_nulos_sex_numerico)

fit_nnet
## parsnip model object
## 
## Fit time:  940ms 
## a 7-10-1 network with 91 weights
## inputs: i_found_partner__attractive i_found_partner__ambitious i_found_partner__fun i_found_partner__intelligent i_found_partner__interests i_found_partner__sincere sex 
## output(s): i_liked_partner 
## options were - linear output units

Relação entre variável de entrada e saída pela rede neural

No caso da rede neural, as relações não precisam ser lineares. É o caso aqui

mean_pred <- predict(fit_nnet, 
                         new_data = pontos_novos_rand_for 
                         )





dados_pred_nnet <- pontos_novos %>% 
  bind_cols(
    mean_pred
  )


ggplot(dados_pred_nnet, aes(x = i_found_partner__attractive)) + 
  geom_point(aes(y = .pred, color = sex)) + 
  labs(y = "urchin size")+
  geom_mark_circle(
    aes(
      y = .pred,
      label = nome,
      group = interaction(nome, sex),
      color = sex,
      fill = sex
    ),
    label.fontsize = 8,
    con.cap = 1,
    expand = 0.001,
    label.buffer = unit(3.5, 'mm'),
    show.legend = FALSE
  ) +
  theme_minimal() +
  theme(
    legend.position = "top"
  ) +
  geom_line(
    aes(
      color = sex,
      y = .pred
    )
  ) +
  scale_color_manual(
    values = escala_sexo
  ) +
  scale_y_continuous(
    breaks = seq(0, to = 1, by= 0.2),
    limits = c(0,1)
  ) +
  labs(y = "Prob. I like partner")

Análise exploratória pode (e deve) ser muito mais profunda

É importante fazer uma sessão de exploração, que pode ser muito mais detalhada do que a que fizemos.

A sessão de exploração nos ajuda fazer alguns testes de sanidade nos dados e a extrair alguns insights que podem ou não ser usados para construir o processo Feature Engineering que pode ajudar o modelo atingir melhores resultados.

O processo de feature engineering é o lugar onde mais podemos melhorar o tipo de modelo que vamos usar na maioria das vezes.

A dependência desse processo é menor quando usamos modelos muito complexos, de deep learning, mas para isso é necessário ter uma quantidade colossal de dados.

Escolha e avaliação do modelo: antes separar os dados de teste

Tudo o que fizermos durante o processo de seleção do modelo, como já vimos, deve ser feito nos dados de treinamento (que tambem servirão como validação).

Após a escolha de UM modelo, vamos avaliá-lo nos dados de teste.

Fonte: Feature Engineering and Selection: A Practical Approach for Predictive Models (Kuhn e Johnson)

A biblioteca rsamples oferece a infraestrutura necessária para retirar amostras dos dados disponíveis.

Usamos ela aqui para isolar os dados de teste.

Ela será usada novamente para criar as amostras usadas no cross-validation.

set.seed() é usada para manter a reprodutibilidade. Com a mesma semente, garantimos que a cada execução do script a mesma sequência (pseudo)aleatória será gerada.

O parâmetro strata garante que o balanceamento de um dos atributos (no caso o que usaremos como saída) será mantido nas duas partições.

dados_classificacao <- dados_finais %>% 
  mutate(
    i_liked_partner = if_else(i_liked_partner == 1, "Liked", "Not") %>% factor(levels = c("Liked","Not"))
  )  
  

set.seed(123)
# Put 3/4 of the data into the training set 
split_dado <- initial_split(
  data = dados_classificacao, 
  strata = i_liked_partner,
  prop = 3/4
)


# Create data frames for the two sets:
dado_treino <- training(split_dado)
dado_teste  <- testing(split_dado)
skim(dado_treino)
Data summary
Name dado_treino
Number of rows 3664
Number of columns 64
_______________________
Column type frequency:
factor 10
numeric 54
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1 FALSE 2 Hom: 1845, Mul: 1819
choice 0 1 FALSE 2 ext: 3076, lim: 588
partner_race 0 1 FALSE 5 Whi: 2015, Asi: 919, Lat: 270, Oth: 267
frequency_date 0 1 TRUE 7 Twi: 968, Sev: 920, Onc: 706, Nev: 602
career_macro 0 1 FALSE 17 Aca: 1088, Ban: 932, Cre: 320, Law: 285
opinion_duration_of_date 0 1 FALSE 3 Too: 2116, Jus: 1473, Too: 75
race 0 1 FALSE 5 Whi: 2049, Asi: 895, Lat: 302, Oth: 258
goal 0 1 FALSE 6 Fun: 1528, Mee: 1351, Dat: 281, To : 215
i_liked_partner 0 1 FALSE 2 Not: 2017, Lik: 1647
partner_career_macro 0 1 FALSE 17 Aca: 1055, Ban: 970, Cre: 303, Law: 283

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
match 0 1 0.18 0.39 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
same_race 0 1 0.40 0.49 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
partner_liked_me 0 1 0.43 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
met_before 0 1 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
unique_id_number 0 1 283.23 157.01 4.00 158.00 274.00 411.00 552.00 ▇▆▇▆▇
id_within_wave 0 1 9.14 5.60 1.00 4.00 8.00 14.00 22.00 ▇▆▅▃▂
subject_within_gender 0 1 17.69 11.17 1.00 8.00 16.00 27.00 44.00 ▇▇▅▅▂
n_people_met_in_wave 0 1 17.02 4.35 5.00 14.00 18.00 20.00 22.00 ▁▂▂▃▇
position_meeting 0 1 9.09 5.48 1.00 4.00 8.00 13.00 22.00 ▇▆▅▅▂
order_meeting 0 1 8.85 5.45 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
partnet_id_within_wave 0 1 9.23 5.49 1.00 5.00 9.00 14.00 22.00 ▇▆▅▅▂
partner_unique_id_number 0 1 283.12 156.95 4.00 158.00 274.00 411.00 552.00 ▇▆▇▆▇
interests_correlation 0 1 0.19 0.30 -0.83 -0.02 0.21 0.43 0.91 ▁▃▇▇▂
my_age 0 1 26.14 3.45 18.00 24.00 26.00 28.00 55.00 ▇▇▁▁▁
partner_age 0 1 26.19 3.35 18.00 24.00 26.00 28.00 55.00 ▇▇▁▁▁
partner_stated_pref_time0_attractive 0 1 0.22 0.11 0.00 0.15 0.20 0.25 1.00 ▇▃▁▁▁
partner_stated_pref_time0_sincere 0 1 0.18 0.07 0.00 0.15 0.18 0.20 0.47 ▁▇▇▁▁
partner_stated_pref_time0_intelligent 0 1 0.21 0.07 0.00 0.18 0.20 0.25 0.50 ▁▇▃▁▁
partner_stated_pref_time0_fun 0 1 0.17 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
partner_stated_pref_time0_ambitious 0 1 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▆▇▇▁▁
partner_stated_pref_time0_shared_interests 0 1 0.12 0.06 0.00 0.10 0.11 0.16 0.30 ▅▇▇▃▁
importance_same_race 0 1 3.87 2.85 1.00 1.00 3.00 6.00 10.00 ▇▃▃▂▂
importance_same_religion 0 1 3.60 2.83 1.00 1.00 3.00 6.00 10.00 ▇▂▃▂▁
happy_expec 0 1 5.52 1.77 1.00 5.00 6.00 7.00 10.00 ▁▃▇▅▁
i_found_partner__attractive 0 1 6.23 1.93 0.00 5.00 6.00 8.00 10.00 ▁▃▇▇▂
i_found_partner__sincere 0 1 7.19 1.73 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__intelligent 0 1 7.42 1.51 0.00 7.00 7.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__fun 0 1 6.46 1.94 0.00 5.00 7.00 8.00 10.00 ▁▂▇▇▃
i_found_partner__ambitious 0 1 6.82 1.79 0.00 6.00 7.00 8.00 10.00 ▁▁▆▇▃
i_found_partner__interests 0 1 5.54 2.14 0.00 4.00 6.00 7.00 10.00 ▂▅▇▆▂
partner_found_me__attractive 0 1 6.20 1.95 0.00 5.00 6.00 8.00 10.00 ▁▃▇▇▂
partner_found_me__sincere 0 1 7.18 1.73 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__intelligent 0 1 7.39 1.54 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__fun 0 1 6.42 1.93 0.00 5.00 7.00 8.00 11.00 ▁▂▇▇▂
partner_found_me__ambitious 0 1 6.75 1.80 0.00 6.00 7.00 8.00 10.00 ▁▂▆▇▃
partner_found_me__interests 0 1 5.51 2.14 0.00 4.00 6.00 7.00 10.00 ▂▅▇▆▂
probability_i_find_partner_liked_me 0 1 5.33 2.12 0.00 4.00 5.00 7.00 10.00 ▂▅▇▅▁
you_look_for__attractive 0 1 0.22 0.11 0.00 0.15 0.20 0.25 0.90 ▇▇▁▁▁
you_look_for__sincere 0 1 0.18 0.07 0.00 0.15 0.18 0.20 0.47 ▁▇▇▁▁
you_look_for__intelligent 0 1 0.20 0.07 0.00 0.18 0.20 0.24 0.50 ▁▇▃▁▁
you_look_for__fun 0 1 0.17 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
you_look_for__ambitious 0 1 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▆▇▇▁▁
you_look_for__shared_interests 0 1 0.12 0.06 0.00 0.10 0.11 0.16 0.30 ▅▇▇▃▁
you_perceive_yourself__attractive 0 1 7.13 1.42 2.00 6.00 7.00 8.00 10.00 ▁▂▂▇▂
you_perceive_yourself__sincere 0 1 8.33 1.43 2.00 8.00 9.00 9.00 10.00 ▁▁▁▆▇
you_perceive_yourself__fun 0 1 7.75 1.58 2.00 7.00 8.00 9.00 10.00 ▁▁▂▇▆
you_perceive_yourself__intelligent 0 1 8.50 1.07 3.00 8.00 9.00 9.00 10.00 ▁▁▂▅▇
you_perceive_yourself__ambitious 0 1 7.64 1.84 2.00 7.00 8.00 9.00 10.00 ▁▂▂▇▇
probability_partner_find_i_liked_partner 0 1 5.27 2.14 0.00 4.00 5.00 7.00 10.00 ▂▅▇▅▁
partner_perceives_himself__attractive 0 1 7.10 1.40 2.00 6.00 7.00 8.00 10.00 ▁▂▂▇▂
partner_perceives_himself__sincere 0 1 8.35 1.42 2.00 8.00 9.00 9.00 10.00 ▁▁▁▆▇
partner_perceives_himself__fun 0 1 7.72 1.57 2.00 7.00 8.00 9.00 10.00 ▁▁▂▇▆
partner_perceives_himself__intelligent 0 1 8.46 1.09 3.00 8.00 8.00 9.00 10.00 ▁▁▂▆▇
partner_perceives_himself__ambitious 0 1 7.65 1.80 2.00 7.00 8.00 9.00 10.00 ▁▂▂▇▆
skim(dado_teste)
Data summary
Name dado_teste
Number of rows 1221
Number of columns 64
_______________________
Column type frequency:
factor 10
numeric 54
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1 FALSE 2 Hom: 611, Mul: 610
choice 0 1 FALSE 2 ext: 1023, lim: 198
partner_race 0 1 FALSE 5 Whi: 668, Asi: 291, Lat: 116, Oth: 90
frequency_date 0 1 TRUE 7 Twi: 333, Sev: 322, Onc: 204, Nev: 191
career_macro 0 1 FALSE 16 Aca: 384, Ban: 304, Law: 116, Cre: 99
opinion_duration_of_date 0 1 FALSE 3 Too: 670, Jus: 523, Too: 28
race 0 1 FALSE 5 Whi: 638, Asi: 295, Oth: 114, Lat: 104
goal 0 1 FALSE 6 Fun: 526, Mee: 442, Dat: 91, Oth: 58
i_liked_partner 0 1 FALSE 2 Not: 672, Lik: 549
partner_career_macro 0 1 FALSE 17 Aca: 341, Ban: 322, Cre: 105, Law: 96

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
match 0 1 0.17 0.38 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
same_race 0 1 0.37 0.48 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
partner_liked_me 0 1 0.44 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
met_before 0 1 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
unique_id_number 0 1 282.60 156.57 4.00 160.00 274.00 408.00 552.00 ▆▆▇▆▇
id_within_wave 0 1 8.98 5.46 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
subject_within_gender 0 1 17.37 10.87 1.00 8.00 16.00 26.00 44.00 ▇▇▆▅▂
n_people_met_in_wave 0 1 16.97 4.27 5.00 15.00 18.00 20.00 22.00 ▁▃▂▅▇
position_meeting 0 1 9.23 5.53 1.00 4.00 9.00 14.00 22.00 ▇▆▅▅▂
order_meeting 0 1 8.95 5.47 1.00 4.00 8.00 13.00 22.00 ▇▆▅▅▂
partnet_id_within_wave 0 1 8.81 5.57 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
partner_unique_id_number 0 1 282.29 157.14 4.00 156.00 274.00 411.00 552.00 ▇▆▇▆▇
interests_correlation 0 1 0.21 0.31 -0.83 -0.01 0.23 0.44 0.91 ▁▃▇▇▂
my_age 0 1 26.19 3.39 18.00 24.00 26.00 28.00 42.00 ▂▇▅▁▁
partner_age 0 1 26.18 3.58 18.00 23.00 26.00 28.00 55.00 ▇▇▁▁▁
partner_stated_pref_time0_attractive 0 1 0.23 0.12 0.00 0.15 0.20 0.25 0.90 ▆▇▂▁▁
partner_stated_pref_time0_sincere 0 1 0.17 0.07 0.00 0.15 0.18 0.20 0.47 ▂▇▇▁▁
partner_stated_pref_time0_intelligent 0 1 0.20 0.07 0.00 0.17 0.20 0.23 0.50 ▂▇▃▁▁
partner_stated_pref_time0_fun 0 1 0.18 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
partner_stated_pref_time0_ambitious 0 1 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▇▇▇▁▁
partner_stated_pref_time0_shared_interests 0 1 0.12 0.06 0.00 0.08 0.11 0.16 0.30 ▆▇▇▅▁
importance_same_race 0 1 3.68 2.79 1.00 1.00 3.00 6.00 10.00 ▇▃▂▂▁
importance_same_religion 0 1 3.66 2.90 1.00 1.00 3.00 6.00 10.00 ▇▃▂▂▂
happy_expec 0 1 5.43 1.82 1.00 4.00 6.00 7.00 10.00 ▁▃▇▅▁
i_found_partner__attractive 0 1 6.29 1.97 0.00 5.00 6.00 8.00 10.00 ▁▃▇▇▃
i_found_partner__sincere 0 1 7.29 1.71 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__intelligent 0 1 7.46 1.55 0.00 7.00 8.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__fun 0 1 6.54 1.94 0.00 5.00 7.00 8.00 10.00 ▁▂▆▇▃
i_found_partner__ambitious 0 1 6.83 1.79 0.00 6.00 7.00 8.00 10.00 ▁▁▆▇▃
i_found_partner__interests 0 1 5.57 2.15 0.00 4.00 6.00 7.00 10.00 ▂▅▇▆▂
partner_found_me__attractive 0 1 6.23 1.89 1.00 5.00 6.00 8.00 10.00 ▁▃▇▇▂
partner_found_me__sincere 0 1 7.13 1.74 1.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__intelligent 0 1 7.36 1.50 1.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__fun 0 1 6.45 1.97 0.00 5.00 7.00 8.00 10.00 ▁▂▇▇▃
partner_found_me__ambitious 0 1 6.80 1.76 1.00 6.00 7.00 8.00 10.00 ▁▂▆▇▃
partner_found_me__interests 0 1 5.48 2.11 0.00 4.00 6.00 7.00 10.00 ▂▅▇▅▂
probability_i_find_partner_liked_me 0 1 5.36 2.12 0.00 4.00 5.00 7.00 10.00 ▂▃▇▅▁
you_look_for__attractive 0 1 0.22 0.11 0.00 0.15 0.20 0.25 0.90 ▇▇▁▁▁
you_look_for__sincere 0 1 0.18 0.07 0.00 0.15 0.19 0.20 0.47 ▁▇▇▁▁
you_look_for__intelligent 0 1 0.20 0.06 0.00 0.18 0.20 0.23 0.50 ▁▇▃▁▁
you_look_for__fun 0 1 0.17 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
you_look_for__ambitious 0 1 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▆▇▇▁▁
you_look_for__shared_interests 0 1 0.12 0.06 0.00 0.08 0.11 0.16 0.30 ▆▇▆▅▁
you_perceive_yourself__attractive 0 1 7.18 1.38 2.00 7.00 7.00 8.00 10.00 ▁▂▂▇▂
you_perceive_yourself__sincere 0 1 8.33 1.45 2.00 8.00 9.00 9.00 10.00 ▁▁▁▆▇
you_perceive_yourself__fun 0 1 7.83 1.58 2.00 7.00 8.00 9.00 10.00 ▁▁▂▇▇
you_perceive_yourself__intelligent 0 1 8.49 1.08 3.00 8.00 9.00 9.00 10.00 ▁▁▂▆▇
you_perceive_yourself__ambitious 0 1 7.68 1.80 2.00 7.00 8.00 9.00 10.00 ▁▂▂▇▇
probability_partner_find_i_liked_partner 0 1 5.27 2.05 0.00 4.00 5.00 7.00 10.00 ▂▃▇▅▁
partner_perceives_himself__attractive 0 1 7.15 1.39 2.00 6.00 7.00 8.00 10.00 ▁▁▂▇▂
partner_perceives_himself__sincere 0 1 8.27 1.49 2.00 7.00 9.00 9.00 10.00 ▁▁▁▆▇
partner_perceives_himself__fun 0 1 7.75 1.57 2.00 7.00 8.00 9.00 10.00 ▁▁▂▇▆
partner_perceives_himself__intelligent 0 1 8.46 1.06 4.00 8.00 9.00 9.00 10.00 ▁▁▂▅▇
partner_perceives_himself__ambitious 0 1 7.59 1.89 2.00 7.00 8.00 9.00 10.00 ▁▂▂▇▇

Especificação e pré-processamento, pra qualquer modelo

Conforme comentamos, as bibliotecas da tidymodels são ortogonais.

A biblioteca recipes serve a um fim específico: definir os passos do pré-processamento dos dados. Esses passos podem ser definidos de forma independente da definição do modelo a ser usado, da forma de cross-validation, da medição da performance etc.

Um dos passos que podem ser definidos na recipes é a identificação de atributos que não têm papel preditivo e, portanto, não devem ser usados no treinamento e na predição, mas que queremos manter no nosso tibble para identificação das linhas.

receita <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
  update_role(
    match,
    unique_id_number,
    id_within_wave,
    subject_within_gender,
    partnet_id_within_wave,
    partner_unique_id_number,
    new_role = "ID"
  )

summary(receita)
## # A tibble: 64 x 4
##    variable                 type    role      source  
##    <chr>                    <chr>   <chr>     <chr>   
##  1 sex                      nominal predictor original
##  2 choice                   nominal predictor original
##  3 partner_race             nominal predictor original
##  4 frequency_date           nominal predictor original
##  5 career_macro             nominal predictor original
##  6 opinion_duration_of_date nominal predictor original
##  7 race                     nominal predictor original
##  8 goal                     nominal predictor original
##  9 match                    numeric ID        original
## 10 same_race                numeric predictor original
## # ... with 54 more rows

Criando dummies

Alguns engines de modelos não trabalham bem com fatores.

Quem está acostumado com a lm, sabe que os fatores são transformados automaticamente em dummies, mas isso não acontece com todos os engines.

step_dummy() faz esse trabalho, ou seja, cria uma variável binária pra cada level do fator (menos um). Veja como podemos usar o seletor all_nominal() e o all_outcomes()

step_zv() retira as variáveis com variância zero. Isso vai acontecer bastante quando temos levels de fatores infrequentes.

receita <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
  update_role(
    match,
    unique_id_number,
    id_within_wave,
    subject_within_gender,
    partnet_id_within_wave,
    partner_unique_id_number,
    new_role = "ID"
  ) %>% 
  step_dummy(
    all_nominal(), -all_outcomes() 
  ) %>% 
  step_zv(all_predictors()) 
  


summary(receita) %>% 
  gt()
variable type role source
sex nominal predictor original
choice nominal predictor original
partner_race nominal predictor original
frequency_date nominal predictor original
career_macro nominal predictor original
opinion_duration_of_date nominal predictor original
race nominal predictor original
goal nominal predictor original
match numeric ID original
same_race numeric predictor original
partner_liked_me numeric predictor original
met_before numeric predictor original
unique_id_number numeric ID original
id_within_wave numeric ID original
subject_within_gender numeric ID original
n_people_met_in_wave numeric predictor original
position_meeting numeric predictor original
order_meeting numeric predictor original
partnet_id_within_wave numeric ID original
partner_unique_id_number numeric ID original
interests_correlation numeric predictor original
my_age numeric predictor original
partner_age numeric predictor original
partner_stated_pref_time0_attractive numeric predictor original
partner_stated_pref_time0_sincere numeric predictor original
partner_stated_pref_time0_intelligent numeric predictor original
partner_stated_pref_time0_fun numeric predictor original
partner_stated_pref_time0_ambitious numeric predictor original
partner_stated_pref_time0_shared_interests numeric predictor original
importance_same_race numeric predictor original
importance_same_religion numeric predictor original
happy_expec numeric predictor original
i_found_partner__attractive numeric predictor original
i_found_partner__sincere numeric predictor original
i_found_partner__intelligent numeric predictor original
i_found_partner__fun numeric predictor original
i_found_partner__ambitious numeric predictor original
i_found_partner__interests numeric predictor original
partner_found_me__attractive numeric predictor original
partner_found_me__sincere numeric predictor original
partner_found_me__intelligent numeric predictor original
partner_found_me__fun numeric predictor original
partner_found_me__ambitious numeric predictor original
partner_found_me__interests numeric predictor original
probability_i_find_partner_liked_me numeric predictor original
you_look_for__attractive numeric predictor original
you_look_for__sincere numeric predictor original
you_look_for__intelligent numeric predictor original
you_look_for__fun numeric predictor original
you_look_for__ambitious numeric predictor original
you_look_for__shared_interests numeric predictor original
you_perceive_yourself__attractive numeric predictor original
you_perceive_yourself__sincere numeric predictor original
you_perceive_yourself__fun numeric predictor original
you_perceive_yourself__intelligent numeric predictor original
you_perceive_yourself__ambitious numeric predictor original
probability_partner_find_i_liked_partner numeric predictor original
partner_career_macro nominal predictor original
partner_perceives_himself__attractive numeric predictor original
partner_perceives_himself__sincere numeric predictor original
partner_perceives_himself__fun numeric predictor original
partner_perceives_himself__intelligent numeric predictor original
partner_perceives_himself__ambitious numeric predictor original
i_liked_partner nominal outcome original

Criando variáveis ordinais

Alguns fatores são ordinais. Pode ser uma boa ideia codificá-los em uma só variável numérica, que vai manter a ordem natural dos levels.

No nosso exemplo, a variável que representa com qual frequência as pessoas saem à noite apresenta níveis que podem ser ordenados.

receita <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
  update_role(
    match,
    unique_id_number,
    id_within_wave,
    subject_within_gender,
    partnet_id_within_wave,
    partner_unique_id_number,
    new_role = "ID"
  ) %>% 
  step_ordinalscore(
    frequency_date
  ) %>% 
  step_dummy(
    all_nominal(), -frequency_date, -all_outcomes()
  ) %>%
  step_zv(all_predictors()) 

Criando um workflow para estimar um modelo

Sabendo criar uma receita de pré-processamento e relembrando como criar uma interface genérica para um modelo com parsnip() e como selecionar um engine pra ele, podemos criar um pequeno fluxo de trabalho para realizar esse processamento, usando a biblioteca workflows()

lr_mod <- 
  logistic_reg() %>% 
  set_engine("glm") 

wf <- workflow() %>% 
  add_recipe(receita) %>% 
  add_model(lr_mod) 

wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_ordinalscore()
## * step_dummy()
## * step_zv()
## 
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

Estimando o modelo com uso do workflow

Com o workflow criado é possível estimá-lo usando a função fit()

fit_teste <- 
  wf %>% 
  fit(
    data = dado_treino
  )

fit_teste %>%  tidy() %>% 
  select(
    term,
    estimate,
    p.value
  ) %>% 
  arrange(
    p.value
  ) %>% 
  gt() %>% 
  fmt_number(
    columns = one_of("estimate"),
    decimals = 3
  ) %>% 
  fmt_number(
    columns = one_of("p.value"),
    decimals = 2
  ) 
term estimate p.value
i_found_partner__attractive −0.573 0.00
probability_i_find_partner_liked_me −0.308 0.00
i_found_partner__interests −0.222 0.00
i_found_partner__fun −0.279 0.00
you_look_for__attractive 5.085 0.00
you_perceive_yourself__fun 0.171 0.00
race_White 0.620 0.00
sex_Homem −0.536 0.00
partner_found_me__attractive 0.120 0.00
career_macro_Politics 2.790 0.00
you_perceive_yourself__sincere 0.133 0.00
same_race −0.376 0.00
importance_same_race 0.063 0.00
i_found_partner__ambitious 0.123 0.00
goal_Serious −0.984 0.00
career_macro_Pro.sports.Athletics −4.469 0.00
probability_partner_find_i_liked_partner −0.083 0.00
you_look_for__ambitious 4.241 0.00
partner_career_macro_Speech.Pathology 2.574 0.00
n_people_met_in_wave −0.054 0.00
i_found_partner__sincere 0.110 0.01
career_macro_Lawyer 0.515 0.01
partner_career_macro_Psychologist 0.729 0.01
partner_career_macro_Creative.Arts.Entertainment 0.443 0.01
career_macro_Other 1.721 0.03
goal_Other 0.623 0.03
frequency_date 0.077 0.03
importance_same_religion 0.040 0.03
partner_liked_me 0.229 0.04
partner_found_me__interests 0.060 0.04
partner_career_macro_Journalism −1.347 0.04
career_macro_Social.Work −0.596 0.04
my_age 0.029 0.05
partner_career_macro_Real.Estate 1.255 0.05
choice_limited −0.429 0.05
happy_expec −0.057 0.06
opinion_duration_of_date_Too.little 0.188 0.06
race_Latino 0.352 0.07
you_perceive_yourself__attractive 0.078 0.08
you_look_for__sincere 1.770 0.08
you_look_for__intelligent 1.716 0.09
partner_perceives_himself__intelligent 0.089 0.09
opinion_duration_of_date_Too.much −0.538 0.09
career_macro_Creative.Arts.Entertainment 0.312 0.09
partner_perceives_himself__fun −0.059 0.11
partner_career_macro_Lawyer −0.311 0.12
partner_race_Others −0.314 0.12
partner_career_macro_Pro.sports.Athletics −2.273 0.12
(Intercept) 2.235 0.13
partner_found_me__intelligent −0.061 0.17
goal_Meet.new.people −0.244 0.18
career_macro_Engineer 0.390 0.19
career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin −0.168 0.20
partner_stated_pref_time0_intelligent −1.271 0.20
partner_perceives_himself__ambitious 0.040 0.22
partner_career_macro_Doctor.Medicine −0.248 0.22
career_macro_Doctor.Medicine 0.231 0.24
partner_career_macro_Social.Work 0.354 0.26
partner_perceives_himself__sincere −0.041 0.26
position_meeting −0.010 0.26
you_look_for__fun −1.132 0.27
race_Black −0.278 0.27
partner_found_me__ambitious −0.037 0.27
career_macro_Journalism −0.690 0.27
you_perceive_yourself__ambitious −0.033 0.30
interests_correlation 0.150 0.33
partner_career_macro_Engineer 0.265 0.38
i_found_partner__intelligent −0.043 0.38
partner_perceives_himself__attractive 0.036 0.40
career_macro_Psychologist −0.223 0.40
career_macro_Speech.Pathology 0.986 0.42
partner_career_macro_Architecture −0.810 0.42
career_macro_Undecided −0.187 0.43
partner_found_me__sincere −0.027 0.46
partner_stated_pref_time0_sincere 0.732 0.48
career_macro_Architecture −0.753 0.50
partner_career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin −0.078 0.55
race_Others −0.121 0.56
partner_race_Black −0.124 0.57
career_macro_Real.Estate −0.987 0.60
partner_career_macro_Other −0.298 0.60
career_macro_International.Humanitarian.Affairs 0.094 0.65
partner_stated_pref_time0_ambitious 0.479 0.71
partner_career_macro_International.Humanitarian.Affairs 0.063 0.74
goal_To.say −0.084 0.75
goal_Fun 0.054 0.76
partner_stated_pref_time0_fun 0.274 0.79
partner_stated_pref_time0_attractive −0.134 0.87
partner_career_macro_Undecided 0.035 0.88
met_before 0.027 0.90
you_perceive_yourself__intelligent −0.006 0.91
partner_found_me__fun −0.004 0.91
partner_race_White −0.014 0.91
partner_race_Latino −0.015 0.94
order_meeting 0.000 0.97
partner_age 0.001 0.97
partner_career_macro_Politics −0.022 0.97
partner_stated_pref_time0_shared_interests NA NA
you_look_for__shared_interests NA NA

Adicionando interação entre preditores

Um tipo de passo interessante que pode ser adicionado à recipe é a adição de novos preditores que representam interações entre os preditores originais.

É interessante poder usar as funções helpers, que tornam muito mais fácil a adição de vários termos de interação de uma só vez.

Aqui vamos usar uma regressão ao invés de classificação para facilitar a interpretação dos coeficientes.

dado_treino_regressao <- dado_treino %>% 
  mutate(
    i_liked_partner = if_else(i_liked_partner == "Liked", 1, 0)
  )


receita_com_interacao <- recipe(i_liked_partner ~ ., data = dado_treino_regressao) %>%
  update_role(
    match,
    unique_id_number,
    id_within_wave,
    subject_within_gender,
    partnet_id_within_wave,
    partner_unique_id_number,
    new_role = "ID"
  ) %>% 
  step_ordinalscore(
    frequency_date
  ) %>% 
  step_dummy(
    all_nominal(), -frequency_date, -all_outcomes() 
  ) %>%
  step_interact(
    terms = ~ starts_with("race")*starts_with("partner_race")*importance_same_race
  ) %>% 
  step_interact(
    terms = ~ starts_with("i_found_partner__")*starts_with("sex_")
  ) %>%  
  step_interact(
    terms = ~ starts_with("interests_correlation")*starts_with("sex_")
  ) %>% 
  step_interact(
    terms = ~ starts_with("partner_age")*starts_with("sex_")
  ) %>%  
  step_interact(
    terms = ~ starts_with("you_perceive_yourself__")*starts_with("sex_")
  ) %>% 
  step_interact(
    terms = ~ starts_with("partner_perceives_himself__")*starts_with("sex_")
  ) %>% 
  step_interact(
    terms = ~ starts_with("career_macro")*starts_with("partner_career_macro")
  ) %>% 
  step_zv(all_predictors()) 

Normalizando os preditores

Alguns modelos funcionam melhor com os preditores normalizados, por exemplo os que aplicam penalidades aos coeficientes que multiplicam os preditores. Com os preditores em uma faixa parecida, esta penalização é mais justa.

receita_com_interacao <- recipe(i_liked_partner ~ ., data = dado_treino_regressao) %>%
  update_role(
    match,
    unique_id_number,
    id_within_wave,
    subject_within_gender,
    partnet_id_within_wave,
    partner_unique_id_number,
    new_role = "ID"
  ) %>% 
  step_ordinalscore(
    frequency_date
  ) %>% 
  step_dummy(
    all_nominal(), -frequency_date, -all_outcomes() 
  ) %>%
  step_interact(
    terms = ~ starts_with("race")*starts_with("partner_race")*importance_same_race
  ) %>% 
  step_interact(
    terms = ~ starts_with("i_found_partner__")*starts_with("sex_")
  ) %>%  
  step_interact(
    terms = ~ starts_with("interests_correlation")*starts_with("sex_")
  ) %>% 
  step_interact(
    terms = ~ starts_with("partner_age")*starts_with("sex_")
  ) %>%  
  step_interact(
    terms = ~ starts_with("you_perceive_yourself__")*starts_with("sex_")
  ) %>% 
  step_interact(
    terms = ~ starts_with("partner_perceives_himself__")*starts_with("sex_")
  ) %>% 
  step_interact(
    terms = ~ starts_with("career_macro")*starts_with("partner_career_macro")
  ) %>% 
  step_zv(all_predictors()) %>% 
  step_center(all_numeric()) %>%
  step_scale(all_numeric())

Regressões lasso e ridge

Com a adição destes termos de interação, ficamos com muitos preditores, vários deles bem correlacionados.

O uso de termos muito correlacionados (não só em pares) pode levar ao fenômeno da colinearidade. O modelo pode atribuir efeito a uma ou outra variável de entrada de acordo com a amostra usada pra treinamento, ficando, portanto, com maior variância.

Uma forma de evitar que muitas variáveis sejam efetivamente usadas no modelo é aplicar uma penalidade de forma a diminuir o número de coeficientes acionados, e, por consequência a variância do modelo. É isso que as regressões do tipo Ridge e Lasso fazem.

Essas regressões que penalizam o número e o tamanho do efeito das relações entre as variáveis de entrada e os coeficientes trocam viés por variância e (no caso da lasso) interpretabilidade: elas têm mais viés, se adaptam menos ao conjunto de treinamento, mas não variam tanto dependendo de qual amostra da população foi escolhida para o treinamento. Além disso, por ter menos coeficientes “ligados” (no caso da lasso), é mais interpretável.

O modelo Elastic Net conjuga a penalização do tipo Ridge com a penalização do tipo Lasso modificando a função de penalização da regressão, que na regressão de mínimos quadrados ordinários, a mais comum, como o nome diz, é o erro quadrático:

\[RSS = \sum_{i = 1}^{n} ( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij})^2 \]

Para a regressão Ridge, os coeficientes são penalizados de forma quadrática. Isso diminui a variância do modelo mas não diminui tantoo número de coeficientes diferentes de 0:

\[Loss_{Ridge} = RSS + \lambda \sum_{j=1}^{p}\beta_j^2 \]

Para a regressão Lasso, os coeficientes são penalizados pelo seu valor absoluto. Isso diminui a variância do modelo E diminui o número de coeficientes diferentes de 0, favorecendo a interpretabilidade

\[Loss_{Lasso} = RSS + \lambda \sum_{j=1}^{p} \left| \beta_j \right| \]

Utilizando os hiperparâmetros

O conceito de interface é importante na engenharia de software.

É sempre melhor depender de interfaces do que de implementações. A parnsip funciona como uma camada de abstração que oferece uma interface única que se encarrega de cuidar da chamada às diferentes implementações.

As interfaces genéricas da parnsip estão prontas para receber os parâmetros mais comuns usados nos engines e estão preparadas para passar ao engine o valor destes parâmetros.

No caso das regressões lineares, alguns engines como o glmnet estão preparados para receber os hiperparâmetros necessários para a implementação da regressão Elçastic Net lasso-ridge.

lr_mod <- 
  linear_reg(penalty = .02, mixture = 1) %>% 
  set_engine("glmnet")

wf_com_interacao <- workflow() %>% 
  add_recipe(receita_com_interacao) %>% 
  add_model(lr_mod) 

wf_com_interacao
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: linear_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 12 Recipe Steps
## 
## * step_ordinalscore()
## * step_dummy()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_zv()
## * ...
## * and 2 more steps.
## 
## -- Model -----------------------------------------------------------------------
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.02
##   mixture = 1
## 
## Computational engine: glmnet

Fit com Elastic Net

Na estimação com Elastic Net temos menos preditores acionados

fit_com_interacao <- 
  wf_com_interacao %>% 
  fit(
    data = dado_treino_regressao
  )

fit_com_interacao %>%  tidy() %>% 
  filter(
    estimate != 0
  ) %>% 
  arrange(
    estimate %>% abs() %>% desc()
  ) %>% 
  select(
    term,
    estimate
  ) %>% 
  gt() %>% 
  fmt_number(
    columns = one_of("estimate"),
    decimals = 3
  )
term estimate
i_found_partner__attractive 0.290
probability_i_find_partner_liked_me 0.147
i_found_partner__interests 0.113
i_found_partner__fun 0.100
i_found_partner__attractive_x_sex_Homem 0.081
you_look_for__attractive −0.055
partner_found_me__attractive −0.055
race_White −0.040
you_perceive_yourself__attractive −0.033
probability_partner_find_i_liked_partner 0.033
you_perceive_yourself__fun −0.029
you_perceive_yourself__sincere −0.029
career_macro_Politics −0.028
importance_same_race −0.021
goal_Other −0.021
you_look_for__shared_interests 0.021
partner_career_macro_Speech.Pathology −0.021
goal_Serious 0.021
i_found_partner__sincere −0.021
career_macro_Creative.Arts.Entertainment_x_partner_career_macro_Creative.Arts.Entertainment −0.020
career_macro_Pro.sports.Athletics 0.020
happy_expec 0.019
i_found_partner__ambitious −0.017
partner_liked_me −0.016
career_macro_Psychologist_x_partner_career_macro_Architecture 0.015
importance_same_religion −0.015
career_macro_Lawyer_x_partner_career_macro_Creative.Arts.Entertainment −0.014
race_Latino_x_importance_same_race −0.014
you_look_for__fun 0.012
career_macro_International.Humanitarian.Affairs_x_partner_career_macro_Creative.Arts.Entertainment −0.011
same_race 0.011
career_macro_Social.Work 0.011
career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin_x_partner_career_macro_Engineer −0.010
goal_Meet.new.people 0.009
opinion_duration_of_date_Too.much 0.009
career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin 0.008
partner_career_macro_Lawyer 0.008
career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin_x_partner_career_macro_Speech.Pathology 0.008
race_White_x_importance_same_race −0.008
partner_career_macro_Real.Estate −0.007
frequency_date −0.006
career_macro_Other_x_partner_career_macro_Lawyer −0.006
career_macro_Doctor.Medicine_x_partner_career_macro_International.Humanitarian.Affairs −0.006
partner_found_me__intelligent 0.006
career_macro_Other −0.006
career_macro_Engineer_x_partner_career_macro_Social.Work −0.005
career_macro_Undecided_x_partner_career_macro_Doctor.Medicine 0.005
partner_career_macro_Journalism 0.005
partner_race_Latino_x_importance_same_race −0.004
career_macro_Lawyer_x_partner_career_macro_Psychologist −0.004
partner_career_macro_Psychologist −0.003
career_macro_Creative.Arts.Entertainment −0.003
partner_found_me__sincere 0.003
career_macro_Lawyer_x_partner_career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin −0.003
career_macro_Creative.Arts.Entertainment_x_partner_career_macro_Speech.Pathology −0.002
career_macro_Engineer_x_partner_career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin 0.002
n_people_met_in_wave 0.002
you_look_for__ambitious −0.002
career_macro_Psychologist_x_partner_career_macro_Psychologist 0.002
career_macro_International.Humanitarian.Affairs_x_partner_career_macro_Engineer −0.001
career_macro_Social.Work_x_partner_career_macro_Other −0.001
career_macro_International.Humanitarian.Affairs_x_partner_career_macro_International.Humanitarian.Affairs −0.001
race_Others_x_partner_race_Latino 0.001
race_Latino_x_partner_race_Latino_x_importance_same_race −0.000
partner_stated_pref_time0_attractive 0.000
career_macro_Lawyer_x_partner_career_macro_Other 0.000
position_meeting 0.000
career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin_x_partner_career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin 0.000
career_macro_Journalism 0.000
partner_career_macro_Social.Work −0.000
(Intercept) −0.000

Mais tipos de pré-processamento

Aqui adicionamos mais alguns passos de pré-processamento:

receita_com_interacao_class <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
  update_role(
    match,
    unique_id_number,
    id_within_wave,
    subject_within_gender,
    partnet_id_within_wave,
    partner_unique_id_number,
    new_role = "ID"
  ) %>% 
  step_ordinalscore(
    frequency_date
  ) %>% 
  step_dummy(
    all_nominal(), -frequency_date, -all_outcomes() 
  ) %>%
  step_mutate(
    diff_age = my_age - partner_age     
  ) %>% 
  step_poly(
    diff_age
  ) %>% 
  step_interact(
    terms = ~ starts_with("diff_age") * starts_with("sex")
  ) %>% 
  step_interact(
    terms = ~ starts_with("race")*starts_with("partner_race")*importance_same_race
  ) %>% 
  step_interact(
    terms = ~ starts_with("i_found_partner__")*starts_with("sex_")
  ) %>%  
  step_interact(
    terms = ~ starts_with("interests_correlation")*starts_with("sex_")
  ) %>%
  step_interact(
    terms = ~ starts_with("partner_age")*starts_with("sex_")
  ) %>%
  step_interact(
    terms = ~ starts_with("you_perceive_yourself__")*starts_with("sex_")
  ) %>%
  step_interact(
    terms = ~ starts_with("partner_perceives_himself__")*starts_with("sex_")
  ) %>%
  step_interact(
    terms = ~ starts_with("partner_found_me__")*starts_with("sex")
  ) %>%
  step_interact(
    terms = ~ starts_with("you_look_for__")*starts_with("sex")
  ) %>%
  step_interact(
    terms = ~ starts_with("you_look_for__")*starts_with("i_found_partner")
  ) %>%
  step_interact(
    terms = ~ starts_with("probability_partner_find_i_liked_partner")*starts_with("sex")
  ) %>%
  step_interact(
    terms = ~ starts_with("career_")*starts_with("sex")
  ) %>%
  step_interact(
    terms = ~ starts_with("partner_career_")*starts_with("sex")
  ) %>%
  step_interact(
    terms = ~ starts_with("goal")*starts_with("sex")
  ) %>%
  step_corr(
    all_predictors(), 
    threshold = 0.8  
  ) %>% 
  step_zv(all_predictors()) %>% 
  step_center(all_numeric()) %>%
  step_scale(all_numeric())
  
  
log_mod <- 
  logistic_reg() %>% 
  set_engine("glm")

wf_com_interacao_class <- workflow() %>% 
  add_recipe(receita_com_interacao_class) %>% 
  add_model(log_mod)


fit_com_interacao_class <- 
  wf_com_interacao_class %>% 
  fit(
    data = dado_treino
  )

Avaliando a curva ROC a partir da yardstick

A curva ROC (Receiver Operating Characteristics) foi inventada na época da Segunda Guerra Mundial para avaliar se os operadores de radar americanos estavam detectando confiavelmente aeronaves japonesas a partir de sinais de radar.

A curva mostra, para vários thresholds, qual a fração de verdadeiros positivos (ou sensibilidade) e a fração de falsos positivos (fall-out, ou \(1 - especificidade\) ).

Uma métrica numérica que traduz o a precisão geral de um modelo de classificação consiste na área embaixo desta curva (AUC, Area Under the Curve). Note que quanto mais perto de um essa área, menor a taxa de falsos positivos e maior a sensibilidade

Aqui executamos a predição nos próprios dados de treinamento e usamos função roc_curve() da yardstick para gerar os dados necessários para plotara curva

pred_like <- predict(
  object = fit_com_interacao_class,
  new_data = dado_teste,
  type = "prob"
) %>% 
  bind_cols(dado_teste %>%  select(i_liked_partner))


dados_roc <- pred_like %>% 
  roc_curve(
    truth = i_liked_partner,
    .pred_Liked
  )

dados_roc %>% 
  filter(
    row_number() %% 100 == 0
  ) %>% 
  gt() %>% 
  fmt_number(
    columns = everything(),
    decimals = 3
  )
.threshold specificity sensitivity
0.015 0.137 0.989
0.048 0.277 0.978
0.109 0.409 0.958
0.188 0.543 0.940
0.297 0.656 0.896
0.446 0.751 0.831
0.542 0.833 0.749
0.669 0.896 0.643
0.787 0.936 0.510
0.871 0.967 0.366
0.939 0.988 0.209
0.991 0.999 0.040

Plotamos, então, a curva

ponto_gatilho <- dados_roc %>% 
  filter(
    .threshold > 0.5
  ) %>% 
  slice_min(
    n = 1, order_by = .threshold
  )

dados_roc %>% ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_path() +
  geom_point(
    data = ponto_gatilho,
    aes(x = 1 - specificity, y = sensitivity),
    size = 3,
    color = "darkblue"
  ) +
  geom_text_repel(
    data = ponto_gatilho,
    aes(
      x = 1 - specificity + 0.15, 
      y = sensitivity - 0.15,
      label = str_glue("{sensitivity %>% percent(accuracy = 0.1)}/{(1 - specificity) %>%  percent(accuracy = 0.1)}")
    )
  ) +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_bw()

Existe uma função que plota a curva automaticamente

dados_roc %>% 
  autoplot()

A função roc_auc calcula a área embaixo da curva

pred_like %>% 
  roc_auc(
    truth = i_liked_partner,
    .pred_Liked
  ) %>% 
  gt()
.metric .estimator .estimate
roc_auc binary 0.8686356

Rodando sem interação

receita_sem_interacao_class <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
  update_role(
    match,
    unique_id_number,
    id_within_wave,
    subject_within_gender, 
    partnet_id_within_wave,
    partner_unique_id_number,
    new_role = "ID"
  ) %>% 
  step_ordinalscore(
    frequency_date
  ) %>% 
  step_dummy(
    all_nominal(), -frequency_date, -all_outcomes() 
  )

wf_sem_interacao_class <- workflow() %>% 
  add_recipe(receita_sem_interacao_class) %>% 
  add_model(log_mod)


fit_sem_interacao_class <- 
  wf_sem_interacao_class %>% 
  fit(
    data = dado_treino
  )

pred_like_sem <- predict(
  object = fit_sem_interacao_class,
  new_data = dado_teste,
  type = "prob"
) %>% 
  bind_cols(dado_teste %>%  select(i_liked_partner))



pred_like_sem %>% 
  roc_curve(
    truth = i_liked_partner,
    .pred_Liked
  ) %>% 
  autoplot()

pred_like_sem %>% 
  roc_auc(
    truth = i_liked_partner,
    .pred_Liked
  ) %>% 
  gt()
.metric .estimator .estimate
roc_auc binary 0.8669198

Comparando as curvas

As curvas ficaram muito parecidas.

Nao adiantou muito a adição de interações, mas isso pode ser causado pela simplicidade do modelo

rocs <- bind_rows(
  roc_curve(pred_like_sem, i_liked_partner, .pred_Liked) %>% mutate(tipo = "Sem interação"),
  roc_curve(pred_like, i_liked_partner, .pred_Liked) %>% mutate(tipo = "Com interação"),
)
  




rocs %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = tipo)) +
  geom_path() +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_bw() +
  theme(
    legend.position = "top"
  )

Implementando cross validation

A biblioteca rsamples, que já vimos, oferece a infraestrutura para geração das amostras necessárias para o Cross Validation.

Relembrando, temos que montar o seguinte esquema:

No código abaixo, fazemos 2 divisões em 5 partes. Ficaremos, portanto, com 10 particionamentos entre treinamento e validação.

folds <- vfold_cv(dado_treino, v = 5, repeats = 2, strata = i_liked_partner)

folds
## #  5-fold cross-validation repeated 2 times using stratification 
## # A tibble: 10 x 3
##    splits             id      id2  
##    <list>             <chr>   <chr>
##  1 <split [2.9K/734]> Repeat1 Fold1
##  2 <split [2.9K/734]> Repeat1 Fold2
##  3 <split [2.9K/732]> Repeat1 Fold3
##  4 <split [2.9K/732]> Repeat1 Fold4
##  5 <split [2.9K/732]> Repeat1 Fold5
##  6 <split [2.9K/734]> Repeat2 Fold1
##  7 <split [2.9K/734]> Repeat2 Fold2
##  8 <split [2.9K/732]> Repeat2 Fold3
##  9 <split [2.9K/732]> Repeat2 Fold4
## 10 <split [2.9K/732]> Repeat2 Fold5
analysis(x = folds$splits[[1]] ) %>%  glimpse()
## Rows: 2,930
## Columns: 64
## $ sex                                        <fct> Mulher, Mulher, Mulher, ...
## $ choice                                     <fct> limited, limited, limite...
## $ partner_race                               <fct> Asian, White, White, Whi...
## $ frequency_date                             <ord> Once a month, Once a mon...
## $ career_macro                               <fct> Lawyer, Lawyer, Lawyer, ...
## $ opinion_duration_of_date                   <fct> Just Right, Just Right, ...
## $ race                                       <fct> White, White, White, Whi...
## $ goal                                       <fct> Fun, Fun, Fun, Fun, Fun,...
## $ match                                      <dbl> 0, 1, 0, 0, 0, 1, 0, 0, ...
## $ same_race                                  <dbl> 0, 1, 1, 1, 1, 1, 1, 1, ...
## $ partner_liked_me                           <dbl> 1, 1, 1, 1, 0, 1, 0, 0, ...
## $ i_liked_partner                            <fct> Not, Liked, Not, Not, No...
## $ met_before                                 <dbl> 0, 1, 0, 0, 0, 0, 0, 0, ...
## $ unique_id_number                           <dbl> 4, 4, 4, 4, 4, 4, 4, 5, ...
## $ id_within_wave                             <dbl> 4, 4, 4, 4, 4, 4, 4, 5, ...
## $ subject_within_gender                      <dbl> 7, 7, 7, 7, 7, 7, 7, 9, ...
## $ n_people_met_in_wave                       <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting                           <dbl> 6, 6, 6, 6, 6, 6, 6, 4, ...
## $ order_meeting                              <dbl> 9, 4, 5, 10, 1, 7, 8, 1,...
## $ partnet_id_within_wave                     <dbl> 3, 4, 6, 7, 8, 9, 10, 1,...
## $ partner_unique_id_number                   <dbl> 13, 14, 16, 17, 18, 19, ...
## $ interests_correlation                      <dbl> 0.05, -0.18, 0.37, 0.35,...
## $ my_age                                     <dbl> 23, 23, 23, 23, 23, 23, ...
## $ partner_age                                <dbl> 22, 23, 25, 30, 27, 28, ...
## $ partner_stated_pref_time0_attractive       <dbl> 0.1900000, 0.3000000, 0....
## $ partner_stated_pref_time0_sincere          <dbl> 0.1800000, 0.0500000, 0....
## $ partner_stated_pref_time0_intelligent      <dbl> 0.1900000, 0.1500000, 0....
## $ partner_stated_pref_time0_fun              <dbl> 0.1800000, 0.4000000, 0....
## $ partner_stated_pref_time0_ambitious        <dbl> 0.1400000, 0.0500000, 0....
## $ partner_stated_pref_time0_shared_interests <dbl> 0.1200000, 0.0500000, 0....
## $ importance_same_race                       <dbl> 1, 1, 1, 1, 1, 1, 1, 8, ...
## $ importance_same_religion                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ happy_expec                                <dbl> 1, 1, 1, 1, 1, 1, 1, 7, ...
## $ i_found_partner__attractive                <dbl> 4, 8, 5, 7, 5, 9, 8, 5, ...
## $ i_found_partner__sincere                   <dbl> 7, 10, 10, 10, 9, 8, 9, ...
## $ i_found_partner__intelligent               <dbl> 8, 7, 8, 10, 9, 10, 10, ...
## $ i_found_partner__fun                       <dbl> 8, 10, 4, 7, 5, 10, 10, ...
## $ i_found_partner__ambitious                 <dbl> 6, 7, 8, 10, 9, 7, 8, 2,...
## $ i_found_partner__interests                 <dbl> 7, 10, 2, 5, 7, 8, 8, 2,...
## $ partner_found_me__attractive               <dbl> 10, 7, 6, 7, 6, 7, 7, 6,...
## $ partner_found_me__sincere                  <dbl> 10, 7, 6, 6, 7, 7, 8, 8,...
## $ partner_found_me__intelligent              <dbl> 10, 7, 7, 3, 8, 7, 8, 8,...
## $ partner_found_me__fun                      <dbl> 10, 9, 7, 5, 6, 10, 7, 8...
## $ partner_found_me__ambitious                <dbl> 10, 9, 8, 6, 6, 9, 8, 7,...
## $ partner_found_me__interests                <dbl> 10, 9, 7, 5, 5, 10, 7, 6...
## $ probability_i_find_partner_liked_me        <dbl> 1, 10, 3, 1, 6, 8, 8, 5,...
## $ you_look_for__attractive                   <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__sincere                      <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__intelligent                  <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__fun                          <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__ambitious                    <dbl> 0.10, 0.10, 0.10, 0.10, ...
## $ you_look_for__shared_interests             <dbl> 0.10, 0.10, 0.10, 0.10, ...
## $ you_perceive_yourself__attractive          <dbl> 7, 7, 7, 7, 7, 7, 7, 6, ...
## $ you_perceive_yourself__sincere             <dbl> 8, 8, 8, 8, 8, 8, 8, 3, ...
## $ you_perceive_yourself__fun                 <dbl> 9, 9, 9, 9, 9, 9, 9, 6, ...
## $ you_perceive_yourself__intelligent         <dbl> 7, 7, 7, 7, 7, 7, 7, 10,...
## $ you_perceive_yourself__ambitious           <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ probability_partner_find_i_liked_partner   <dbl> 10, 10, 6, 4, 7, 8, 7, 5...
## $ partner_career_macro                       <fct> Lawyer, Lawyer, Banking/...
## $ partner_perceives_himself__attractive      <dbl> 4, 9, 6, 7, 6, 10, 7, 8,...
## $ partner_perceives_himself__sincere         <dbl> 7, 9, 6, 7, 8, 6, 7, 9, ...
## $ partner_perceives_himself__fun             <dbl> 8, 9, 8, 6, 6, 10, 10, 7...
## $ partner_perceives_himself__intelligent     <dbl> 8, 9, 8, 8, 8, 10, 10, 8...
## $ partner_perceives_himself__ambitious       <dbl> 3, 9, 6, 4, 9, 10, 10, 5...

Executando cross validation

Executar o cross validation, como vimos, significa estimar o modelo várias vezes em execuções que são completamente independentes, o que é perfeito para executar em paralelo.

A função fit_resamples() faz todas as estimações.

É possível registrar um backend para fazer a execução de forma paralela

all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

log_mod <- 
  logistic_reg() %>% 
  set_engine("glm")

wf_com_interacao_class <- workflow() %>% 
  add_recipe(receita_com_interacao_class) %>% 
  add_model(log_mod)

fit_com_interacao_resample <- 
  wf_com_interacao_class %>% 
  fit_resamples(
    folds,
    control = control_resamples(
      allow_par = TRUE 
    )
  )

stopCluster(cl)

fit_com_interacao_resample
## # Resampling results
## # 5-fold cross-validation repeated 2 times using stratification 
## # A tibble: 10 x 5
##    splits             id      id2   .metrics         .notes          
##    <list>             <chr>   <chr> <list>           <list>          
##  1 <split [2.9K/734]> Repeat1 Fold1 <tibble [2 x 4]> <tibble [3 x 1]>
##  2 <split [2.9K/734]> Repeat1 Fold2 <tibble [2 x 4]> <tibble [2 x 1]>
##  3 <split [2.9K/732]> Repeat1 Fold3 <tibble [2 x 4]> <tibble [3 x 1]>
##  4 <split [2.9K/732]> Repeat1 Fold4 <tibble [2 x 4]> <tibble [2 x 1]>
##  5 <split [2.9K/732]> Repeat1 Fold5 <tibble [2 x 4]> <tibble [3 x 1]>
##  6 <split [2.9K/734]> Repeat2 Fold1 <tibble [2 x 4]> <tibble [3 x 1]>
##  7 <split [2.9K/734]> Repeat2 Fold2 <tibble [2 x 4]> <tibble [3 x 1]>
##  8 <split [2.9K/732]> Repeat2 Fold3 <tibble [2 x 4]> <tibble [2 x 1]>
##  9 <split [2.9K/732]> Repeat2 Fold4 <tibble [2 x 4]> <tibble [3 x 1]>
## 10 <split [2.9K/732]> Repeat2 Fold5 <tibble [2 x 4]> <tibble [3 x 1]>

A função collect_metrics() recebe o objeto retornado por fit_resamples() e retorna uma estrutura com os resultados das execuções, no nosso caso 10. Como não há hiperparâmetros para variar, há apenas duas linhas, com duas métricas relativas ao mesmo conjunto único de hiperparâmetros.

collect_metrics(fit_com_interacao_resample) %>% 
  gt() %>% 
  fmt_percent(
    columns = vars(mean, std_err)
  )
.metric .estimator mean n std_err .config
accuracy binary 76.69% 10 0.28% Preprocessor1_Model1
roc_auc binary 84.66% 10 0.33% Preprocessor1_Model1

Fazendo o tuning nos hiperparâmetros

O modelo de regressão logística que rodamos anteriormente não tinha hiperparâmetros, mas ao rodar modelos como Elastic Net (lasso-ridge) gostaríamos de avaliar qual o melhor conjunto de hiperparâmetros nos dados de validação.

A biblioteca tune nos ajuda a fazer esse procedimento.

No momento de definir o modelo, atribuimos a resposta de tune() aos parâmetros do modelo que queremos variar para fins de tuning

tune_spec_logistic_reg <- logistic_reg(
  penalty = tune(),
  mixture = tune()
) %>% 
  set_engine(
    engine = "glmnet"
  ) %>% 
  set_mode(
    "classification"
  )


tune_spec_logistic_reg
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = tune()
## 
## Computational engine: glmnet

Definindo os valores a serem testados

Uma das formas de executar esta busca pelo melhor conjunto de hiperparâmetros é criar um grid com várias especificações.

A biblioteca dials ajuda na criação deste grid de valores pros hiperparâmetros.

A função grid_regular() cria um grid com intervalos regulares (esses intervalos regulares, dependendo do parâmetro podem ser regulares em log, por exemplo).

Existem funções com os mesmos nomes usados para os parâmetros na interface genérica da parnsnip. Quando chamadas sem parâmetros, essas funções geram valores que normalmente fazem sentido, mas é possível escolher os valores de forma personalizada.

O parâmetro levels define quantos valores diferentes serão usados para cada parâmetro.

net_grid <- grid_regular(
  penalty(range = c(-4,-1)),
  mixture(),
  levels = c(
    penalty = 10,
    mixture = 10
  )
)

net_grid
## # A tibble: 100 x 2
##     penalty mixture
##       <dbl>   <dbl>
##  1 0.0001         0
##  2 0.000215       0
##  3 0.000464       0
##  4 0.001          0
##  5 0.00215        0
##  6 0.00464        0
##  7 0.01           0
##  8 0.0215         0
##  9 0.0464         0
## 10 0.1            0
## # ... with 90 more rows

Rodando o tuning

A função tunegrid() roda a busca dentro deste grid de valores de parâmetros rodando o processo de cross validation de acordo com o que for passado para o parâmetro resample.

all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


wf_logreg_tune_sample <- workflow() %>% 
  add_model(
    tune_spec_logistic_reg
  ) %>% 
  add_recipe(receita_com_interacao_class)


res_logreg_tune_sample_optim <- wf_logreg_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = net_grid,
    control = control_grid(allow_par = TRUE)
  )  

stopCluster(cl)

res_logreg_tune_sample_optim
## # Tuning results
## # 5-fold cross-validation repeated 2 times using stratification 
## # A tibble: 10 x 5
##    splits             id      id2   .metrics           .notes          
##    <list>             <chr>   <chr> <list>             <list>          
##  1 <split [2.9K/734]> Repeat1 Fold1 <tibble [200 x 6]> <tibble [1 x 1]>
##  2 <split [2.9K/734]> Repeat1 Fold2 <tibble [200 x 6]> <tibble [1 x 1]>
##  3 <split [2.9K/732]> Repeat1 Fold3 <tibble [200 x 6]> <tibble [1 x 1]>
##  4 <split [2.9K/732]> Repeat1 Fold4 <tibble [200 x 6]> <tibble [1 x 1]>
##  5 <split [2.9K/732]> Repeat1 Fold5 <tibble [200 x 6]> <tibble [1 x 1]>
##  6 <split [2.9K/734]> Repeat2 Fold1 <tibble [200 x 6]> <tibble [1 x 1]>
##  7 <split [2.9K/734]> Repeat2 Fold2 <tibble [200 x 6]> <tibble [1 x 1]>
##  8 <split [2.9K/732]> Repeat2 Fold3 <tibble [200 x 6]> <tibble [1 x 1]>
##  9 <split [2.9K/732]> Repeat2 Fold4 <tibble [200 x 6]> <tibble [1 x 1]>
## 10 <split [2.9K/732]> Repeat2 Fold5 <tibble [200 x 6]> <tibble [1 x 1]>

Visualizando os resultados

Uma forma de visualizar os resultados é com geom_tile, após rodar a função collect_metrics()

plot_result_tune <- function(results){
  results %>% 
    collect_metrics() %>%
    filter(
      .metric == "roc_auc"
    ) %>% 
    mutate(
      ranque = rank(mean)
    ) %>%     
    ggplot() +
    geom_tile(
      aes(
        x = penalty,
        y = mixture,
        fill = ranque
      )
    ) +
    geom_shadowtext(
      aes(
        x = penalty,
        y = mixture,
        label = percent(mean, accuracy = .01),
      ),
      size = 3,
      color = "white",
      bg.colour="black",
      
    ) +
    scale_x_log10() +
    scale_fill_gradient(low = "white", high = "darkgreen") +
    theme_minimal() +
    theme(
      legend.position = "top"
    )
}


plot_result_tune(res_logreg_tune_sample_optim)

res_logreg_tune_sample_optim %>% 
  collect_metrics() %>% 
  filter(
    .metric == "roc_auc"
  ) %>% 
  arrange(mean %>%  desc()) %>% 
  gt() %>% 
  fmt_percent(
    columns = vars(mean, std_err)
  ) %>% 
  fmt_number(
    columns = vars(penalty, mixture),
    n_sigfig = 2
  )
penalty mixture .metric .estimator mean n std_err .config
0.010 0.33 roc_auc binary 85.32% 10 0.30% Preprocessor1_Model037
0.0046 0.67 roc_auc binary 85.30% 10 0.31% Preprocessor1_Model066
0.0046 0.78 roc_auc binary 85.30% 10 0.31% Preprocessor1_Model076
0.010 0.44 roc_auc binary 85.29% 10 0.29% Preprocessor1_Model047
0.010 0.22 roc_auc binary 85.29% 10 0.31% Preprocessor1_Model027
0.0046 0.56 roc_auc binary 85.29% 10 0.32% Preprocessor1_Model056
0.0046 0.89 roc_auc binary 85.28% 10 0.30% Preprocessor1_Model086
0.0022 1.0 roc_auc binary 85.28% 10 0.33% Preprocessor1_Model095
0.0046 0.44 roc_auc binary 85.26% 10 0.32% Preprocessor1_Model046
0.0046 1.0 roc_auc binary 85.25% 10 0.30% Preprocessor1_Model096
0.0022 0.89 roc_auc binary 85.25% 10 0.33% Preprocessor1_Model085
0.022 0.11 roc_auc binary 85.24% 10 0.29% Preprocessor1_Model018
0.0022 0.78 roc_auc binary 85.23% 10 0.33% Preprocessor1_Model075
0.022 0.22 roc_auc binary 85.21% 10 0.28% Preprocessor1_Model028
0.010 0.56 roc_auc binary 85.21% 10 0.28% Preprocessor1_Model057
0.0046 0.33 roc_auc binary 85.21% 10 0.32% Preprocessor1_Model036
0.0022 0.67 roc_auc binary 85.19% 10 0.33% Preprocessor1_Model065
0.010 0.11 roc_auc binary 85.18% 10 0.31% Preprocessor1_Model017
0.0010 1.0 roc_auc binary 85.16% 10 0.33% Preprocessor1_Model094
0.0022 0.56 roc_auc binary 85.15% 10 0.33% Preprocessor1_Model055
0.0010 0.89 roc_auc binary 85.14% 10 0.32% Preprocessor1_Model084
0.0046 0.22 roc_auc binary 85.13% 10 0.32% Preprocessor1_Model026
0.0010 0.78 roc_auc binary 85.13% 10 0.32% Preprocessor1_Model074
0.010 0.67 roc_auc binary 85.12% 10 0.28% Preprocessor1_Model067
0.0022 0.44 roc_auc binary 85.12% 10 0.32% Preprocessor1_Model045
0.0010 0.67 roc_auc binary 85.11% 10 0.31% Preprocessor1_Model064
0.0022 0.33 roc_auc binary 85.09% 10 0.32% Preprocessor1_Model035
0.0010 0.56 roc_auc binary 85.09% 10 0.31% Preprocessor1_Model054
0.0046 0.11 roc_auc binary 85.07% 10 0.32% Preprocessor1_Model016
0.00010 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model001
0.00022 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model002
0.00046 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model003
0.0010 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model004
0.0022 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model005
0.0046 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model006
0.010 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model007
0.022 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model008
0.0022 0.22 roc_auc binary 85.06% 10 0.31% Preprocessor1_Model025
0.0010 0.44 roc_auc binary 85.06% 10 0.31% Preprocessor1_Model044
0.046 0.11 roc_auc binary 85.04% 10 0.26% Preprocessor1_Model019
0.022 0.33 roc_auc binary 85.04% 10 0.26% Preprocessor1_Model038
0.046 0 roc_auc binary 85.04% 10 0.29% Preprocessor1_Model009
0.00046 1.0 roc_auc binary 85.03% 10 0.30% Preprocessor1_Model093
0.0010 0.33 roc_auc binary 85.02% 10 0.30% Preprocessor1_Model034
0.010 0.78 roc_auc binary 85.01% 10 0.27% Preprocessor1_Model077
0.0022 0.11 roc_auc binary 85.01% 10 0.31% Preprocessor1_Model015
0.00046 0.89 roc_auc binary 85.01% 10 0.30% Preprocessor1_Model083
0.00046 0.78 roc_auc binary 85.00% 10 0.30% Preprocessor1_Model073
0.0010 0.22 roc_auc binary 84.99% 10 0.30% Preprocessor1_Model024
0.00046 0.67 roc_auc binary 84.98% 10 0.30% Preprocessor1_Model063
0.00046 0.56 roc_auc binary 84.96% 10 0.30% Preprocessor1_Model053
0.00046 0.44 roc_auc binary 84.94% 10 0.30% Preprocessor1_Model043
0.0010 0.11 roc_auc binary 84.93% 10 0.31% Preprocessor1_Model014
0.00022 1.0 roc_auc binary 84.91% 10 0.30% Preprocessor1_Model092
0.00046 0.33 roc_auc binary 84.91% 10 0.30% Preprocessor1_Model033
0.00022 0.89 roc_auc binary 84.90% 10 0.30% Preprocessor1_Model082
0.010 0.89 roc_auc binary 84.90% 10 0.26% Preprocessor1_Model087
0.00022 0.78 roc_auc binary 84.89% 10 0.30% Preprocessor1_Model072
0.00046 0.22 roc_auc binary 84.88% 10 0.31% Preprocessor1_Model023
0.00022 0.67 roc_auc binary 84.87% 10 0.30% Preprocessor1_Model062
0.00022 0.56 roc_auc binary 84.85% 10 0.30% Preprocessor1_Model052
0.00046 0.11 roc_auc binary 84.85% 10 0.31% Preprocessor1_Model013
0.10 0 roc_auc binary 84.85% 10 0.28% Preprocessor1_Model010
0.00010 1.0 roc_auc binary 84.84% 10 0.31% Preprocessor1_Model091
0.00022 0.44 roc_auc binary 84.84% 10 0.30% Preprocessor1_Model042
0.00022 0.33 roc_auc binary 84.83% 10 0.30% Preprocessor1_Model032
0.022 0.44 roc_auc binary 84.82% 10 0.25% Preprocessor1_Model048
0.00022 0.22 roc_auc binary 84.81% 10 0.31% Preprocessor1_Model022
0.00010 0.89 roc_auc binary 84.81% 10 0.30% Preprocessor1_Model081
0.00010 0.78 roc_auc binary 84.81% 10 0.30% Preprocessor1_Model071
0.00010 0.11 roc_auc binary 84.80% 10 0.31% Preprocessor1_Model011
0.00022 0.11 roc_auc binary 84.80% 10 0.31% Preprocessor1_Model012
0.00010 0.67 roc_auc binary 84.79% 10 0.31% Preprocessor1_Model061
0.00010 0.56 roc_auc binary 84.79% 10 0.31% Preprocessor1_Model051
0.00010 0.44 roc_auc binary 84.78% 10 0.31% Preprocessor1_Model041
0.010 1.0 roc_auc binary 84.78% 10 0.26% Preprocessor1_Model097
0.00010 0.33 roc_auc binary 84.77% 10 0.31% Preprocessor1_Model031
0.00010 0.22 roc_auc binary 84.77% 10 0.31% Preprocessor1_Model021
0.046 0.22 roc_auc binary 84.61% 10 0.24% Preprocessor1_Model029
0.022 0.56 roc_auc binary 84.56% 10 0.23% Preprocessor1_Model058
0.10 0.11 roc_auc binary 84.33% 10 0.23% Preprocessor1_Model020
0.022 0.67 roc_auc binary 84.33% 10 0.22% Preprocessor1_Model068
0.046 0.33 roc_auc binary 84.20% 10 0.21% Preprocessor1_Model039
0.022 0.78 roc_auc binary 84.10% 10 0.21% Preprocessor1_Model078
0.022 0.89 roc_auc binary 83.85% 10 0.20% Preprocessor1_Model088
0.046 0.44 roc_auc binary 83.80% 10 0.19% Preprocessor1_Model049
0.022 1.0 roc_auc binary 83.61% 10 0.19% Preprocessor1_Model098
0.10 0.22 roc_auc binary 83.58% 10 0.21% Preprocessor1_Model030
0.046 0.56 roc_auc binary 83.41% 10 0.18% Preprocessor1_Model059
0.046 0.67 roc_auc binary 83.05% 10 0.17% Preprocessor1_Model069
0.10 0.33 roc_auc binary 82.99% 10 0.19% Preprocessor1_Model040
0.046 0.78 roc_auc binary 82.67% 10 0.17% Preprocessor1_Model079
0.10 0.44 roc_auc binary 82.57% 10 0.18% Preprocessor1_Model050
0.046 0.89 roc_auc binary 82.32% 10 0.18% Preprocessor1_Model089
0.10 0.56 roc_auc binary 82.22% 10 0.19% Preprocessor1_Model060
0.046 1.0 roc_auc binary 82.13% 10 0.19% Preprocessor1_Model099
0.10 0.67 roc_auc binary 81.83% 10 0.22% Preprocessor1_Model070
0.10 0.78 roc_auc binary 81.57% 10 0.23% Preprocessor1_Model080
0.10 0.89 roc_auc binary 81.29% 10 0.24% Preprocessor1_Model090
0.10 1.0 roc_auc binary 80.90% 10 0.23% Preprocessor1_Model100

Árvore de decisão

A árvore de decisão particiona o espaço formado pelas variáveis explicativas em subespaços baseando-se na “pureza” desses subespaços com relação à variável dependente.

Abaixo fazemos uma experiência com apenas 2 features contínuas e uma feature categórica com dois valores possíveis .

receita_arvore_decisao_demo <- recipe(
  i_liked_partner ~
    i_found_partner__attractive +
    sex +
    my_age,
   data = dado_treino
)

arvore_decisao_mod <- 
  decision_tree(
    tree_depth = 4,
    min_n = 1,
    cost_complexity = 0
    
  ) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

wf_arvore_decisao_demo <- workflow() %>% 
  add_recipe(receita_arvore_decisao_demo) %>% 
  add_model(arvore_decisao_mod)

fit_ad_demo <- 
  wf_arvore_decisao_demo %>% 
  fit(
    data = dado_treino
  )

O algoritmo cira uma árvore de decisão como essa.

Ele escolhe, portanto, a ordem e os valores dos atributos que fatiarão a população. Seguindo a árvore até suas folhas, que são os nós sem filhos, podemos determinar a saída prevista de cada valor do vetor x.

fit_ad_demo$fit$fit$fit %>% rpart.plot() 

Temos um espaço formado por duas features contínuas e uma feature categórica com duas categorias, podemos pensar este espaço como dois planos.

O algoritmo da árvore de decisão escolhe qual feature divide cada plano em duas partes mais puras no sentido da classificação (com mais pontos com a mesma categoria).

Como temos duas features contínuas e mais uma com duas categorias possíveis, isso é equivalente a dividir o espaço representado por dois planos com retas na vertical e horizontal.

valores_partner_atractive = tibble(i_found_partner__attractive = seq(0, 10, by = 0.1))

valores_my_age = 
  tibble(
    my_age = seq(min(dado_treino$my_age), max(dado_treino$my_age), by = 1 ) 
  )

valores_sex = tibble( sex = c("Homem", "Mulher")  )



dados_novos <- crossing(
  valores_partner_atractive,
  valores_my_age,
  valores_sex
)

predicoes_arvore <- predict(
  object = fit_ad_demo,
  new_data = dados_novos
)


predicoes_arvore_com_dados <- bind_cols(
  dados_novos,
  predicoes_arvore
) 


ggplot(predicoes_arvore_com_dados) +
  geom_tile(
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      fill = .pred_class,
      alpha = 0.1
    )
  ) +
  geom_jitter(
    data = dado_treino,
    width = 0.5,
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      color = i_liked_partner
    ),
    size = 0.8,
    show.legend = FALSE
  ) +
  geom_vline(
    xintercept = 6.75
  ) +
  geom_vline(
    xintercept = 7.25
  ) +
  geom_hline(
    data = tibble(
      my_age = 20.5,
      sex = "Homem"
    ), 
    aes(
      yintercept = my_age
    )
  ) +
  geom_hline(
    data = tibble(
      my_age = 28.5,
      sex = "Mulher"
    ), 
    aes(
      yintercept = my_age
    )
  ) +
  facet_wrap(
    ~sex
  ) +
  guides(
    alpha = FALSE
  ) +
  theme_minimal() +
  scale_fill_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  scale_color_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  theme(
    legend.position = "top"
  ) +
  labs(
    fill = "",
    x = "Achei bonito",
    y = "Minha idade"
  )

log_reg_mod <- 
  logistic_reg(
  ) %>% 
  set_engine("glm") %>% 
  set_mode("classification")

wf_log_reg_demo <- workflow() %>% 
  add_recipe(receita_arvore_decisao_demo) %>% 
  add_model(log_reg_mod)

fit_log_reg_demo <- 
  wf_log_reg_demo %>% 
  fit(
    data = dado_treino
  )

A título de comparação, podemos ver o mesmo esquema com a regressão logística

predicoes_log_reg <- predict(
  object = fit_log_reg_demo,
  new_data = dados_novos
)


predicoes_log_reg_com_dados <- bind_cols(
  dados_novos,
  predicoes_log_reg
) 



ggplot(predicoes_log_reg_com_dados) +
  geom_tile(
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      fill = .pred_class,
      alpha = 0.1
    )
  ) +
  geom_jitter(
    data = dado_treino,
    width = 0.5,
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      color = i_liked_partner
    ),
    size = 0.8,
    show.legend = FALSE
  ) +
  facet_wrap(
    ~sex
  ) +
  guides(
    alpha = FALSE
  ) +
  theme_minimal() +
  scale_fill_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  scale_color_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  theme(
    legend.position = "top"
  ) +
  labs(
    fill = "",
    x = "Achei bonito",
    y = "Minha idade"
  )

Random Forest

A forma como a árvore de decisão é criada faz com que ela tenha muita variância.

Cada decisão de particionamento é tomada a partir de características que podem ser muito específicas aos dados de treinamento.

Uma ideia usada nas Random Forests é criar \(trees\) conjuntos de treinamento a partir do conjunto original, mas retirando amostras de mesmo tamanho do conjunto original, com reposição. Além disso, cada vez que a árvore é particionada, a partição só pode acontecer em \(mtry\) das variáveis explicativas.

O resultado final é uma média da decisão dessas \(trees\) árvores.

Estas duas mudanças fazem com que o modelo tenha uma variância muito menor do que as árvore de decisão simples.

ranger_mod <- 
  rand_forest(
    mtry = 2,
    trees = 100,
    min_n = 1
  ) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

wf_ranger_demo <- workflow() %>% 
  add_recipe(receita_arvore_decisao_demo) %>% 
  add_model(ranger_mod)

fit_ranger_demo <- 
  wf_ranger_demo %>% 
  fit(
    data = dado_treino
  )

Random Forest: maior flexibilidade

Abaixo podemos ver que o modelo oferece muito mais flexibilidade que as árvores de decisão simples.

predicoes_ranger <- predict(
  object = fit_ranger_demo,
  new_data = dados_novos
)


predicoes_ranger_com_dados <- bind_cols(
  dados_novos,
  predicoes_ranger
) 



ggplot(predicoes_ranger_com_dados) +
  geom_tile(
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      fill = .pred_class,
      alpha = 0.1
    )
  ) +
  geom_jitter(
    data = dado_treino,
    width = 0.5,
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      color = i_liked_partner
    ),
    size = 0.8,
    show.legend = FALSE
  ) +
  facet_wrap(
    ~sex
  ) +
  guides(
    alpha = FALSE
  ) +
  theme_minimal() +
  scale_fill_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  scale_color_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  theme(
    legend.position = "top"
  ) +
  labs(
    fill = "",
    x = "Achei bonito",
    y = "Minha idade"
  )

Random Forest: montando o tuning da Random Forest

Abaixo montamos a nova configuração de modelo, com seus hiperparâmetros a serem tunados.

tune_spec_rand_forest <- rand_forest(
  mtry = tune(),
  trees = tune(),
  min_n = tune()
) %>% 
  set_engine(
    engine = "ranger"
  ) %>% 
  set_mode(
    "classification"
  )

tune_spec_rand_forest
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
##   min_n = tune()
## 
## Computational engine: ranger

Fazendo o tuning da random forest

Vamos criar um primeiro grid para tentar otimizar o valor dos hiperparâmetros

rand_for_grid <- grid_regular(
  mtry(range = c(3,50)),
  trees(range = c(1,100)),
  min_n(),
  levels = 4
)

rand_for_grid
## # A tibble: 64 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     3     1     2
##  2    18     1     2
##  3    34     1     2
##  4    50     1     2
##  5     3    34     2
##  6    18    34     2
##  7    34    34     2
##  8    50    34     2
##  9     3    67     2
## 10    18    67     2
## # ... with 54 more rows
wf_rand_for_tune_sample <- workflow() %>% 
  add_model(
    tune_spec_rand_forest
  ) %>% 
  add_recipe(receita_com_interacao_class)
all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


res_rand_for <- wf_rand_for_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = rand_for_grid,
    control = control_grid(allow_par = TRUE, verbose = TRUE)
  )  

stopCluster(cl)
res_rand_for <- read_rds("resultados/res_rand_for.rds" )


collect_metrics(res_rand_for) %>% arrange(desc(mean)) %>% 
  filter(
    .metric == "roc_auc"
  ) %>% 
  head(
    n = 10
  ) %>% 
  gt() %>% 
  fmt_percent(
    columns = vars(mean, std_err)
  ) 
mtry trees min_n .metric .estimator mean n std_err .config
18 100 2 roc_auc binary 88.03% 10 0.48% Preprocessor1_Model14
50 100 2 roc_auc binary 87.72% 10 0.49% Preprocessor1_Model16
34 100 2 roc_auc binary 87.67% 10 0.46% Preprocessor1_Model15
18 67 2 roc_auc binary 87.67% 10 0.47% Preprocessor1_Model10
18 100 14 roc_auc binary 87.60% 10 0.50% Preprocessor1_Model30
50 67 2 roc_auc binary 87.44% 10 0.49% Preprocessor1_Model12
34 100 14 roc_auc binary 87.43% 10 0.50% Preprocessor1_Model31
34 67 2 roc_auc binary 87.43% 10 0.44% Preprocessor1_Model11
18 67 14 roc_auc binary 87.41% 10 0.47% Preprocessor1_Model26
50 100 14 roc_auc binary 87.34% 10 0.43% Preprocessor1_Model32

Visualizando os primeiros resultados do tuning

plot_result_tune_ranger <- function(results){
  

  results %>% 
    map_df(
      .f = collect_metrics
    ) %>% 
    filter(
      .metric == "roc_auc"
    ) %>% 
    mutate(
      ranque = rank(mean)
    ) %>% 
    ggplot() +
    geom_tile(
      aes(
        x = mtry %>%  factor(),
        y = trees %>%  factor(),
        fill = ranque
      )
    ) +
    geom_shadowtext(
      aes(
        x = mtry %>% factor(),
        y = trees %>%  factor(),
        label = percent(mean, accuracy = .01),
      ),
      color = "white",
      bg.colour="black",
      size = 3,
    ) +
    scale_fill_gradient(low = "white", high = "darkgreen") +
    facet_wrap(
      ~min_n, 
      ncol = 1, 
      labeller = as_labeller( function(x){str_glue("min_n: {x}")}  )
    ) +
    theme_minimal() +
    theme(
      legend.position = "top"
    ) +
    labs(
      x = "mtry",
      y = "trees"
       
    ) 
}

plot_result_tune_ranger(list(res_rand_for))

Segunda rodada de otimização

rand_for_grid_optim <- grid_regular(
  mtry(range = c(4,45)),
  trees(range = c(150,350)),
  min_n(c(1,6)),
  levels = 4
)





all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


res_rand_for_optim <- wf_rand_for_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = rand_for_grid_optim,
    control = control_grid(allow_par = TRUE, verbose = TRUE)
  )  

stopCluster(cl)
res_rand_for_optim <- read_rds("resultados/res_rand_for_optim.rds" )


collect_metrics(res_rand_for_optim) %>% arrange(desc(mean)) %>% 
  filter(
    .metric == "roc_auc"
  ) %>% 
  head(
    n = 10
  ) %>% 
  gt() %>% 
  fmt_percent(
    columns = vars(mean, std_err)
  ) 
mtry trees min_n .metric .estimator mean n std_err .config
17 283 1 roc_auc binary 88.29% 10 0.46% Preprocessor1_Model10
17 350 2 roc_auc binary 88.19% 10 0.45% Preprocessor1_Model30
17 350 1 roc_auc binary 88.18% 10 0.44% Preprocessor1_Model14
17 216 2 roc_auc binary 88.18% 10 0.45% Preprocessor1_Model22
17 216 1 roc_auc binary 88.13% 10 0.47% Preprocessor1_Model06
17 350 4 roc_auc binary 88.12% 10 0.46% Preprocessor1_Model46
17 283 2 roc_auc binary 88.12% 10 0.44% Preprocessor1_Model26
17 350 6 roc_auc binary 88.10% 10 0.43% Preprocessor1_Model62
17 283 4 roc_auc binary 88.09% 10 0.46% Preprocessor1_Model42
31 283 4 roc_auc binary 88.07% 10 0.45% Preprocessor1_Model43
plot_result_tune_ranger(list(res_rand_for, res_rand_for_optim))

Terceira rodada de otimização

rand_for_grid_optim_2 <- grid_regular(
  mtry(range = c(10,35)),
  trees(range = c(350,500)),
  min_n(range = c(1, 3)),
  levels = c(mtry = 3, trees = 4, min_n = 3)
    
)



all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


res_rand_for_optim_2 <- wf_rand_for_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = rand_for_grid_optim_2,
    control = control_grid(allow_par = TRUE, verbose = TRUE)
  )  

stopCluster(cl)
res_rand_for_optim_2 <- read_rds("resultados/res_rand_for_optim_2.rds" )

collect_metrics(res_rand_for_optim_2) %>% arrange(desc(mean)) %>% 
  filter(
    .metric == "roc_auc"
  ) %>% 
  head(
    n = 10
  ) %>% 
  gt() %>% 
  fmt_percent(
    columns = vars(mean, std_err)
  ) 
mtry trees min_n .metric .estimator mean n std_err .config
10 500 1 roc_auc binary 88.31% 10 0.42% Preprocessor1_Model10
22 500 1 roc_auc binary 88.30% 10 0.42% Preprocessor1_Model11
10 500 2 roc_auc binary 88.26% 10 0.42% Preprocessor1_Model22
22 350 1 roc_auc binary 88.24% 10 0.44% Preprocessor1_Model02
10 450 1 roc_auc binary 88.24% 10 0.43% Preprocessor1_Model07
22 450 1 roc_auc binary 88.24% 10 0.47% Preprocessor1_Model08
22 400 1 roc_auc binary 88.24% 10 0.48% Preprocessor1_Model05
10 400 2 roc_auc binary 88.24% 10 0.43% Preprocessor1_Model16
22 500 3 roc_auc binary 88.22% 10 0.45% Preprocessor1_Model35
22 500 2 roc_auc binary 88.21% 10 0.47% Preprocessor1_Model23
plot_result_tune_ranger(list(res_rand_for, res_rand_for_optim, res_rand_for_optim_2 ))

Quarta rodada de otimização

rand_for_grid_optim_3 <- grid_regular(
  mtry(range = c(8, 22)),
  trees(range = c(400,500)),
  min_n(range = c(1,1)),
  levels = c(mtry = 5, trees = 5, min_n = 1)
    
)

rand_for_grid_optim_3
## # A tibble: 25 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     8   400     1
##  2    11   400     1
##  3    15   400     1
##  4    18   400     1
##  5    22   400     1
##  6     8   425     1
##  7    11   425     1
##  8    15   425     1
##  9    18   425     1
## 10    22   425     1
## # ... with 15 more rows
res_rand_for_optim_3 <- read_rds("resultados/res_rand_for_optim_3.rds" )

collect_metrics(res_rand_for_optim_3) %>% arrange(desc(mean)) %>% 
  filter(
    .metric == "roc_auc"
  ) %>% 
  head(
    n = 10
  ) %>% 
  gt() %>% 
  fmt_percent(
    columns = vars(mean, std_err)
  ) 
mtry trees min_n .metric .estimator mean n std_err .config
22 475 1 roc_auc binary 88.25% 10 0.42% Preprocessor1_Model20
18 500 1 roc_auc binary 88.25% 10 0.46% Preprocessor1_Model24
15 450 1 roc_auc binary 88.24% 10 0.45% Preprocessor1_Model13
15 500 1 roc_auc binary 88.24% 10 0.45% Preprocessor1_Model23
18 475 1 roc_auc binary 88.23% 10 0.44% Preprocessor1_Model19
11 450 1 roc_auc binary 88.22% 10 0.42% Preprocessor1_Model12
15 425 1 roc_auc binary 88.22% 10 0.44% Preprocessor1_Model08
11 500 1 roc_auc binary 88.22% 10 0.44% Preprocessor1_Model22
18 425 1 roc_auc binary 88.22% 10 0.46% Preprocessor1_Model09
18 400 1 roc_auc binary 88.22% 10 0.45% Preprocessor1_Model04
plot_result_tune_ranger(
  list(
    res_rand_for,
    res_rand_for_optim,
    res_rand_for_optim_2,
    res_rand_for_optim_3
  )
)

Redes neurais

Apesar do hype que envolve as redes neurais, que faz elas parecerem mágicas e misteriosas, elas são métodos não lineares onde são aplicadas regressões lineares em cima de saída de outras regressões. Algumas dessas saídas são modificadas por funções de ativação não lineares, dando caráter não linear ao método. Os coeficientes dessas regressões, também chamados de pesos, são calibrados de forma a minimizar o erro através de um algoritmo inteligente chamado backpropagation.

Uma arquitetura de rede neural

Existem várias arquiteturas de redes neurais.

Vamos usar a arquitetura mais simples e mais usada: single hidden layer.

Como mostra a figura à esquerda, esse modelo tem uma camada de entrada, uma camada “escondida” intermediária e uma camada de saída.

\(p\) é o número de entradas, \(K\) o número de categorias de saída possíveis (no caso de classificação, no caso de regressão há uma saída) e \(M\), o número de neurônios na camada escondida.

O nome neurônio vem de uma simplificação de como funciona a célula. Assim como os nós da rede neural, a célula também recebe várias entradas e tem um processo de ativação (liga ou desliga) dependendo da intensidade da informação que rebebe nas entradas.

No entanto o cérebro humano é uma rede extremamente complexa com uma arquitetura que foi selecionada durante bilhões de anos.

Existem arquiteturas de rede neural mais complicadas, mas o número de pesos a serem calibrados é explosivo. Elas exigem uma quantidade massiva de dados. A área que estuda essa redes profundas se chama deep learning.

Um exemplo de rede neural com 3 entradas

Vamos fazer o mesmo exercício que já fizemos com os outros modelos, treinando ela para apenas dois features contínuos e um categórico.

set.seed(555)


receita_nnet_demo <- recipe(
  i_liked_partner ~
    i_found_partner__attractive +
    sex +
    my_age,
   data = dado_treino
) %>% 
  step_center(all_numeric()) %>%
  step_scale(all_numeric())
  

nnet_mod <- 
  mlp(
    hidden_units = 10,
    epochs = 1000
  ) %>% 
  set_engine("nnet") %>% 
  set_mode("classification")

wf_nnet_demo <- workflow() %>% 
  add_recipe(receita_nnet_demo) %>% 
  add_model(nnet_mod)

fit_nnet_demo <- 
  wf_nnet_demo %>% 
  fit(
    data = dado_treino
  )
  
fit_nnet_demo
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: mlp()
## 
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
## 
## * step_center()
## * step_scale()
## 
## -- Model -----------------------------------------------------------------------
## a 3-10-1 network with 51 weights
## inputs: i_found_partner__attractive sexHomem my_age 
## output(s): ..y 
## options were - entropy fitting

Podemos perceber que o modelo cria um padrão de classificação complexo, que consegue considerar a interação entre as entradas.

predicoes_nnet <- predict(
  object = fit_nnet_demo,
  new_data = dados_novos
)


predicoes_nnet_com_dados <- bind_cols(
  dados_novos,
  predicoes_nnet
) 



ggplot(predicoes_nnet_com_dados) +
  geom_tile(
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      fill = .pred_class,
      alpha = 0.1
    )
  ) +
  geom_jitter(
    data = dado_treino,
    width = 0.5,
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      color = i_liked_partner
    ),
    size = 0.8,
    show.legend = FALSE
  ) +
  facet_wrap(
    ~sex
  ) +
  guides(
    alpha = FALSE
  ) +
  theme_minimal() +
  scale_fill_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  scale_color_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  theme(
    legend.position = "top"
  ) +
  labs(
    fill = "",
    x = "Achei bonito",
    y = "Minha idade"
  )

Fazendo tuning na rede neural

hidden units é o número de neurônios na camada escondida. Quanto maior o número maior a flexibilidade da rede.

penalty é uma penalidade similar à da regressão do tipo ridge, também limita o número de pesos ativos.

epochs é o número de iterações usadas pra treinar a rede. Quanto mais iterações mais adaptada ao conjunto de entrada a rede estará.

nnet_grid <- grid_regular(
  hidden_units(),
  penalty(),
  epochs(),
  levels = 3
)

nnet_grid
## # A tibble: 27 x 3
##    hidden_units      penalty epochs
##           <int>        <dbl>  <int>
##  1            1 0.0000000001     10
##  2            5 0.0000000001     10
##  3           10 0.0000000001     10
##  4            1 0.00001          10
##  5            5 0.00001          10
##  6           10 0.00001          10
##  7            1 1                10
##  8            5 1                10
##  9           10 1                10
## 10            1 0.0000000001    505
## # ... with 17 more rows
receita_com_interacao_class <- receita_com_interacao_class %>% 
  step_center(all_numeric()) %>%
  step_scale(all_numeric())
  

tune_spec_nnet <- parsnip::mlp(
  
  hidden_units = tune(),
  penalty = tune(),
  epochs = tune()

) %>% 
  set_engine(
    engine = "nnet",
    MaxNWts = 3000
  ) %>% 
  set_mode(
    "classification"
  )


wf_nnet_tune_sample <- workflow() %>% 
  add_model(
    tune_spec_nnet
  ) %>% 
  add_recipe(receita_com_interacao_class_center_scale)


all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


res_nnet <- wf_nnet_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = nnet_grid,
    control = control_grid(allow_par = TRUE, verbose = TRUE)
  )  

stopCluster(cl)
 



write_rds(res_nnet, "resultados/res_nnet.rds")
res_nnet <- read_rds("resultados/res_nnet.rds")

collect_metrics(res_nnet) %>% arrange(desc(mean))
## # A tibble: 54 x 9
##    hidden_units   penalty epochs .metric .estimator  mean     n std_err .config 
##           <int>     <dbl>  <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>   
##  1            1  1.00e+ 0   1000 roc_auc binary     0.847    10 0.00281 Preproc~
##  2            1  1.00e+ 0    505 roc_auc binary     0.847    10 0.00251 Preproc~
##  3           10  1.00e+ 0     10 roc_auc binary     0.836    10 0.00390 Preproc~
##  4           10  1.00e+ 0    505 roc_auc binary     0.836    10 0.00354 Preproc~
##  5           10  1.00e-10     10 roc_auc binary     0.834    10 0.00352 Preproc~
##  6           10  1.00e+ 0   1000 roc_auc binary     0.831    10 0.00525 Preproc~
##  7            5  1.00e+ 0     10 roc_auc binary     0.828    10 0.00460 Preproc~
##  8            5  1.00e+ 0    505 roc_auc binary     0.828    10 0.00405 Preproc~
##  9           10  1.00e- 5     10 roc_auc binary     0.825    10 0.00558 Preproc~
## 10            1  1.00e+ 0     10 roc_auc binary     0.823    10 0.00557 Preproc~
## # ... with 44 more rows
plot_result_tune_nnet <- function(results){
  

  results %>% 
    map_df(
      .f = collect_metrics
    ) %>% 
    filter(
      .metric == "roc_auc"
    ) %>% 
    group_by(
      penalty, epochs, hidden_units
    ) %>% 
    summarise(
      mean = mean(mean)
    ) %>% 
    ungroup() %>% 
    mutate(
      ranque = rank(mean)
    ) %>% 
    ggplot() +
    geom_tile(
      aes(
        x = penalty ,
        y = epochs %>%  factor(),
        fill = ranque
      )
    ) +
    scale_x_log10() +
    geom_shadowtext(
      aes(
        x = penalty,
        y = epochs %>%  factor(),
        label = percent(mean, accuracy = .01),
      ),
      color = "white",
      bg.colour="black",
      size = 3
    ) +
    scale_fill_gradient(low = "white", high = "darkgreen") +
    facet_wrap(
      ~hidden_units, 
      ncol = 1,
      labeller = as_labeller( function(x){str_glue("hidden: {x}")}  )
    ) +
    theme_minimal() +
    theme(
      legend.position = "top"
    )
}


plot_result_tune_nnet(list(res_nnet))

Fazendo tuning na rede neural, segunda rodada

nnet_grid_2 <- grid_regular(
  hidden_units(range = c(1,10)),
  penalty( range = c(-1,1)),
  epochs(range = c(10, 1000)),
  levels = c(hidden_units = 3, penalty = 2, epochs = 3)
)

nnet_grid_2
## # A tibble: 18 x 3
##    hidden_units penalty epochs
##           <int>   <dbl>  <int>
##  1            1     0.1     10
##  2            5     0.1     10
##  3           10     0.1     10
##  4            1    10       10
##  5            5    10       10
##  6           10    10       10
##  7            1     0.1    505
##  8            5     0.1    505
##  9           10     0.1    505
## 10            1    10      505
## 11            5    10      505
## 12           10    10      505
## 13            1     0.1   1000
## 14            5     0.1   1000
## 15           10     0.1   1000
## 16            1    10     1000
## 17            5    10     1000
## 18           10    10     1000
all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


res_nnet_2 <- wf_nnet_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = nnet_grid_2,
    control = control_grid(allow_par = TRUE, verbose = TRUE)
  )  

stopCluster(cl)
 
write_rds(res_nnet_2, "resultados/res_nnet_2.rds")
res_nnet_2 <- read_rds("resultados/res_nnet_2.rds")


collect_metrics(res_nnet_2) %>% arrange(desc(mean))
## # A tibble: 36 x 9
##    hidden_units penalty epochs .metric .estimator  mean     n std_err .config   
##           <int>   <dbl>  <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>     
##  1           10    10     1000 roc_auc binary     0.859    10 0.00308 Preproces~
##  2           10    10      505 roc_auc binary     0.859    10 0.00317 Preproces~
##  3            5    10     1000 roc_auc binary     0.858    10 0.00302 Preproces~
##  4            5    10      505 roc_auc binary     0.858    10 0.00342 Preproces~
##  5            1    10      505 roc_auc binary     0.849    10 0.00320 Preproces~
##  6            1    10     1000 roc_auc binary     0.848    10 0.00314 Preproces~
##  7           10    10       10 roc_auc binary     0.842    10 0.00455 Preproces~
##  8            5    10       10 roc_auc binary     0.838    10 0.00274 Preproces~
##  9           10     0.1     10 roc_auc binary     0.837    10 0.00372 Preproces~
## 10            1     0.1    505 roc_auc binary     0.832    10 0.00159 Preproces~
## # ... with 26 more rows
plot_result_tune_nnet(list(res_nnet_2, res_nnet))

nnet_grid_3 <- grid_regular(
  hidden_units(range = c(1,15)),
  penalty( range = c(0,1)),
  epochs(range = c(1500, 2000)),
  levels = c(hidden_units = 4, penalty = 2, epochs = 2)
)





all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


res_nnet_3 <- wf_nnet_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = nnet_grid_3,
    control = control_grid(allow_par = TRUE, verbose = TRUE)
  )  

stopCluster(cl)
 
write_rds(res_nnet_3, "resultados/res_nnet_3.rds")
res_nnet_3 <- read_rds("resultados/res_nnet_3.rds")


collect_metrics(res_nnet_3) %>% arrange(desc(mean))
## # A tibble: 32 x 9
##    hidden_units penalty epochs .metric .estimator  mean     n std_err .config   
##           <int>   <dbl>  <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>     
##  1            5      10   2000 roc_auc binary     0.859    10 0.00288 Preproces~
##  2           10      10   2000 roc_auc binary     0.859    10 0.00300 Preproces~
##  3           10      10   1500 roc_auc binary     0.859    10 0.00339 Preproces~
##  4           15      10   2000 roc_auc binary     0.859    10 0.00308 Preproces~
##  5           15      10   1500 roc_auc binary     0.858    10 0.00318 Preproces~
##  6            5      10   1500 roc_auc binary     0.858    10 0.00347 Preproces~
##  7            1      10   1500 roc_auc binary     0.849    10 0.00314 Preproces~
##  8            1      10   2000 roc_auc binary     0.848    10 0.00318 Preproces~
##  9           15       1   2000 roc_auc binary     0.847    10 0.00304 Preproces~
## 10            1       1   1500 roc_auc binary     0.846    10 0.00284 Preproces~
## # ... with 22 more rows
plot_result_tune_nnet(list(res_nnet_2, res_nnet, res_nnet_3))

nnet_grid_4 <- grid_regular(  
  hidden_units(range = c(10,15)),
  penalty( range = c(1,2)),
  epochs(range = c(2500, 3500)),
  levels = c(hidden_units = 2, penalty = 2, epochs = 3)
)



all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


res_nnet_4 <- wf_nnet_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = nnet_grid_4,
    control = control_grid(allow_par = TRUE, verbose = TRUE)
  )  

stopCluster(cl)
 
write_rds(res_nnet_4, "resultados/res_nnet_4.rds")
res_nnet_4 <- read_rds("resultados/res_nnet_4.rds")


collect_metrics(res_nnet_4) %>% arrange(desc(mean))
## # A tibble: 24 x 9
##    hidden_units penalty epochs .metric .estimator  mean     n std_err .config   
##           <int>   <dbl>  <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>     
##  1           10      10   2500 roc_auc binary     0.859    10 0.00328 Preproces~
##  2           15      10   2500 roc_auc binary     0.859    10 0.00323 Preproces~
##  3           10      10   3000 roc_auc binary     0.858    10 0.00319 Preproces~
##  4           15      10   3500 roc_auc binary     0.858    10 0.00306 Preproces~
##  5           15      10   3000 roc_auc binary     0.858    10 0.00314 Preproces~
##  6           10      10   3500 roc_auc binary     0.858    10 0.00308 Preproces~
##  7           10     100   3000 roc_auc binary     0.834    10 0.00428 Preproces~
##  8           10     100   2500 roc_auc binary     0.834    10 0.00428 Preproces~
##  9           10     100   3500 roc_auc binary     0.834    10 0.00427 Preproces~
## 10           15     100   2500 roc_auc binary     0.834    10 0.00426 Preproces~
## # ... with 14 more rows
plot_result_tune_nnet(list(res_nnet_2, res_nnet, res_nnet_3, res_nnet_4))

Validando o modelo: selecionando o melhor modelo

Já experimentamos alguns modelos e otimizamos seus hiperparâmetros, testando-os no conjunto de validação, sempre fora do conjunto de treinamento.

O nosso melhor modelo foi uma das configurações de random forest.

Selecionamos esse modelo com select_best()

best_ranger <- 
  select_best(
    res_rand_for_optim_2  
  )

best_ranger
## # A tibble: 1 x 4
##    mtry trees min_n .config              
##   <int> <int> <int> <chr>                
## 1    10   500     1 Preprocessor1_Model10

Então criamos um workflow baseado no workflow que estávamos usando mas adicionando a informação de que este é o modelo escolhido

final_wf_ranger <- wf_rand_for_tune_sample %>% 
  finalize_workflow(best_ranger)

final_wf_ranger
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 22 Recipe Steps
## 
## * step_ordinalscore()
## * step_dummy()
## * step_mutate()
## * step_poly()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * ...
## * and 12 more steps.
## 
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 10
##   trees = 500
##   min_n = 1
## 
## Computational engine: ranger

Validando o modelo: treinamento final

Treinando o modelo com todos os dados de treinamento/validação

final_ranger <- 
  final_wf_ranger %>%
  fit(data = dado_treino) 


final_ranger
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 22 Recipe Steps
## 
## * step_ordinalscore()
## * step_dummy()
## * step_mutate()
## * step_poly()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * ...
## * and 12 more steps.
## 
## -- Model -----------------------------------------------------------------------
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~10L,      x), num.trees = ~500L, min.node.size = min_rows(~1L, x),      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      3664 
## Number of independent variables:  164 
## Mtry:                             10 
## Target node size:                 1 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1446971

Validando o modelo: avaliando fora da amostra

Vamos avaliar o modelo fora da amostra usada pra validação.

all_cores <- parallel::detectCores(logical = FALSE)


cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


final_fit_ranger <- 
  final_wf_ranger %>% 
  last_fit(split_dado)

stopCluster(cl)

write_rds(final_fit_ranger, "resultados/final_fit_ranger.rds")

Podemos perceber que o modelo conseguiu generalizar os resultados para os dados de teste. Uma boa notícia.

final_fit_ranger <- read_rds("resultados/final_fit_ranger.rds")


final_fit_ranger %>% collect_metrics() %>% 
  gt() %>% 
  fmt_percent(
    columns = vars(.estimate)
  )
.metric .estimator .estimate .config
accuracy binary 80.84% Preprocessor1_Model1
roc_auc binary 89.26% Preprocessor1_Model1

Performance nos dados de teste: curva ROC

curva <- final_fit_ranger %>% 
  collect_predictions() %>% 
  roc_curve(
    truth = i_liked_partner,
    .pred_Liked
  )

ponto_gatilho <- curva %>% 
  filter(
    .threshold > 0.45 
  ) %>% 
  slice_min(
    n = 1, order_by = .threshold
  )
  

  
curva %>% ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_path() +
  geom_point(
    data = ponto_gatilho,
    aes(x = 1 - specificity, y = sensitivity),
    size = 3,
    color = "darkblue"
  ) +
  geom_text_repel(
    data = ponto_gatilho,
    aes(
      x = 1 - specificity + 0.15, 
      y = sensitivity - 0.15,
      label = str_glue("{sensitivity %>% percent(accuracy = 0.1)}/{(1 - specificity) %>%  percent(accuracy = 0.1)}")
    )
  ) +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_bw()

Séries temporais: resumo muito sucinto

As séries temporais possuem propriedades impõem desafios ao uso do arcabouço que vimos anteriormente. Estas propriedades levaram à criação de modelos específicos: os modelos autorregressivos do tipo ARIMA.

A principal delas é o fato de que as observações das séries temporais frequentemente não são independentes, ou seja \(x_t\) depende de \(x_{t-1}, x_{t-2}, ...\). Isso acontece para a maioria das séries que interessam pra nós.

É muito comum, também, lidarmos com séries não-estacionárias.

Existem alguns tipos de estacionariedade, mas, grosso modo, a série estacionária é aquela que mantém sua distribuição ao longo do tempo, ou seja, a distribuição de \(x_t\), portanto média e variância também, não depende de \(t\).

Lidamos com muitas séries onde a cada \(t\) as informações que chegam são incorporadas à série como alterações (relativas e não absolutas) no patamar que a série tinha em \(t-1\), como é o caso de preços de ativos (não derivativos). Estas séries temporais flutuam ao redor de patamares diferentes ao longo do tempo.

Tidyverts: tidyverse das time series

Tidyverts é um conjunto de bibliotecas tidy friendly para lidar com séries temporais.

É possível obter mais informações em tidyverts.org.

Além da documentação exposta neste site o autor disponibiliza um livro sobre séries temporais, que não é tão aprofundado quanto os livros clássicos do tema, como os do Box e do Hamilton, mas apresenta os conceitos adequadamente, além de estar disponível online e usar a tidyverts nos exemplos.

Forecasting: Principles and Practice

Criando um tsibble

O código abaixo cria um tsibble, a estrutura usada na tidyverts para representar uma série temporal.

Vale a pena notar:

crime_rio <- read_csv2(
  "dados/crime_rio/BaseDPEvolucaoMensalCisp.csv",
  locale = locale(encoding = "latin1")
) 

crime_estado <- crime_rio %>% 
  select(
    mes_ano,
    hom_doloso, lesao_corp_morte, latrocinio, hom_por_interv_policial, tentat_hom, lesao_corp_dolosa, estupro, hom_culposo, lesao_corp_culposa, roubo_comercio, roubo_residencia, roubo_veiculo, roubo_carga, roubo_transeunte, roubo_em_coletivo, roubo_banco, roubo_cx_eletronico, roubo_celular, roubo_conducao_saque, roubo_apos_saque, roubo_bicicleta, outros_roubos, total_roubos, furto_veiculos, furto_transeunte, furto_coletivo, furto_celular, furto_bicicleta, outros_furtos, total_furtos, sequestro, extorsao, sequestro_relampago, estelionato, apreensao_drogas, posse_drogas, trafico_drogas, apreensao_drogas_sem_autor, recuperacao_veiculos, apf, aaapai, cmp, cmba, ameaca, pessoas_desaparecidas, encontro_cadaver, encontro_ossada, pol_militares_mortos_serv, pol_civis_mortos_serv, indicador_letalidade, indicador_roubo_rua, indicador_roubo_veic, registro_ocorrencias, fase   
  ) %>% 
  group_by(
    mes_ano
  ) %>% 
  summarise(
    across(
      .cols = everything(),
      .fns = ~sum(.x, na.rm = TRUE)
    )
  ) %>% 
  ungroup() %>% 
  separate(
    col = mes_ano,
    sep = "m",
    into = c("ano", "mes")
  ) %>% 
  mutate(
    data = make_date(ano, mes, 1) %>% yearmonth()
  ) %>% 
  arrange(data) %>% 
  mutate(
    across(
      .cols = -c(ano, mes, data) ,
      .fns = list(diff_log = function(x){c(NA,x %>% log() %>% diff())}  )
    )
  ) %>% 
  mutate(
    across(
      .cols = ends_with("diff_log"),
      .fns = ~if_else(is.na(.x), 0, .x)
    )
  ) %>% 
  relocate(
    data,
    hom_doloso,
    hom_doloso_diff_log
  ) 

crime_estado_ts <- as_tsibble(crime_estado, index = data)

O tsibble é criado e entende que tem um dado a cada mês

crime_estado_ts
## # A tsibble: 196 x 111 [1M]
##        data hom_doloso hom_doloso_diff~ ano   mes   lesao_corp_morte latrocinio
##       <mth>      <dbl>            <dbl> <chr> <chr>            <dbl>      <dbl>
##  1 2003 jan        589           0      2003  1                    6         20
##  2 2003 fev        583          -0.0102 2003  2                    4         12
##  3 2003 mar        613           0.0502 2003  3                    5         16
##  4 2003 abr        585          -0.0468 2003  4                   11         14
##  5 2003 mai        599           0.0236 2003  5                    6         20
##  6 2003 jun        546          -0.0926 2003  6                    2         14
##  7 2003 jul        527          -0.0354 2003  7                    8         17
##  8 2003 ago        500          -0.0526 2003  8                    4         10
##  9 2003 set        512           0.0237 2003  9                    4         16
## 10 2003 out        489          -0.0460 2003  10                   1         16
## # ... with 186 more rows, and 104 more variables:
## #   hom_por_interv_policial <dbl>, tentat_hom <dbl>, lesao_corp_dolosa <dbl>,
## #   estupro <dbl>, hom_culposo <dbl>, lesao_corp_culposa <dbl>,
## #   roubo_comercio <dbl>, roubo_residencia <dbl>, roubo_veiculo <dbl>,
## #   roubo_carga <dbl>, roubo_transeunte <dbl>, roubo_em_coletivo <dbl>,
## #   roubo_banco <dbl>, roubo_cx_eletronico <dbl>, roubo_celular <dbl>,
## #   roubo_conducao_saque <dbl>, roubo_apos_saque <dbl>, roubo_bicicleta <int>,
## #   outros_roubos <dbl>, total_roubos <dbl>, furto_veiculos <dbl>,
## #   furto_transeunte <dbl>, furto_coletivo <dbl>, furto_celular <dbl>,
## #   furto_bicicleta <int>, outros_furtos <dbl>, total_furtos <dbl>,
## #   sequestro <dbl>, extorsao <dbl>, sequestro_relampago <dbl>,
## #   estelionato <dbl>, apreensao_drogas <dbl>, posse_drogas <int>,
## #   trafico_drogas <int>, apreensao_drogas_sem_autor <int>,
## #   recuperacao_veiculos <dbl>, apf <int>, aaapai <int>, cmp <int>, cmba <int>,
## #   ameaca <dbl>, pessoas_desaparecidas <dbl>, encontro_cadaver <dbl>,
## #   encontro_ossada <dbl>, pol_militares_mortos_serv <dbl>,
## #   pol_civis_mortos_serv <dbl>, indicador_letalidade <dbl>,
## #   indicador_roubo_rua <dbl>, indicador_roubo_veic <dbl>,
## #   registro_ocorrencias <dbl>, fase <dbl>, lesao_corp_morte_diff_log <dbl>,
## #   latrocinio_diff_log <dbl>, hom_por_interv_policial_diff_log <dbl>,
## #   tentat_hom_diff_log <dbl>, lesao_corp_dolosa_diff_log <dbl>,
## #   estupro_diff_log <dbl>, hom_culposo_diff_log <dbl>,
## #   lesao_corp_culposa_diff_log <dbl>, roubo_comercio_diff_log <dbl>,
## #   roubo_residencia_diff_log <dbl>, roubo_veiculo_diff_log <dbl>,
## #   roubo_carga_diff_log <dbl>, roubo_transeunte_diff_log <dbl>,
## #   roubo_em_coletivo_diff_log <dbl>, roubo_banco_diff_log <dbl>,
## #   roubo_cx_eletronico_diff_log <dbl>, roubo_celular_diff_log <dbl>,
## #   roubo_conducao_saque_diff_log <dbl>, roubo_apos_saque_diff_log <dbl>,
## #   roubo_bicicleta_diff_log <dbl>, outros_roubos_diff_log <dbl>,
## #   total_roubos_diff_log <dbl>, furto_veiculos_diff_log <dbl>,
## #   furto_transeunte_diff_log <dbl>, furto_coletivo_diff_log <dbl>,
## #   furto_celular_diff_log <dbl>, furto_bicicleta_diff_log <dbl>,
## #   outros_furtos_diff_log <dbl>, total_furtos_diff_log <dbl>,
## #   sequestro_diff_log <dbl>, extorsao_diff_log <dbl>,
## #   sequestro_relampago_diff_log <dbl>, estelionato_diff_log <dbl>,
## #   apreensao_drogas_diff_log <dbl>, posse_drogas_diff_log <dbl>,
## #   trafico_drogas_diff_log <dbl>, apreensao_drogas_sem_autor_diff_log <dbl>,
## #   recuperacao_veiculos_diff_log <dbl>, apf_diff_log <dbl>,
## #   aaapai_diff_log <dbl>, cmp_diff_log <dbl>, cmba_diff_log <dbl>,
## #   ameaca_diff_log <dbl>, pessoas_desaparecidas_diff_log <dbl>,
## #   encontro_cadaver_diff_log <dbl>, encontro_ossada_diff_log <dbl>,
## #   pol_militares_mortos_serv_diff_log <dbl>,
## #   pol_civis_mortos_serv_diff_log <dbl>, indicador_letalidade_diff_log <dbl>,
## #   ...

Várias séries no mesmo tsibble

Na série anterior, somamos a cada mês os crimes do estado todo. No código abaixo, criamos um tsibble que contém as séries de cada delegacia.

Isso é feito usando o parâmetro key.

crime_delegacia <- crime_rio %>% 
  select(
    CISP, AISP,  RISP, munic, Regiao,
    mes_ano,
    hom_doloso, lesao_corp_morte, latrocinio, hom_por_interv_policial, tentat_hom, lesao_corp_dolosa, estupro, hom_culposo, lesao_corp_culposa, roubo_comercio, roubo_residencia, roubo_veiculo, roubo_carga, roubo_transeunte, roubo_em_coletivo, roubo_banco, roubo_cx_eletronico, roubo_celular, roubo_conducao_saque, roubo_apos_saque, roubo_bicicleta, outros_roubos, total_roubos, furto_veiculos, furto_transeunte, furto_coletivo, furto_celular, furto_bicicleta, outros_furtos, total_furtos, sequestro, extorsao, sequestro_relampago, estelionato, apreensao_drogas, posse_drogas, trafico_drogas, apreensao_drogas_sem_autor, recuperacao_veiculos, apf, aaapai, cmp, cmba, ameaca, pessoas_desaparecidas, encontro_cadaver, encontro_ossada, pol_militares_mortos_serv, pol_civis_mortos_serv, indicador_letalidade, indicador_roubo_rua, indicador_roubo_veic, registro_ocorrencias, fase   
  ) %>% 
  separate(
    col = mes_ano,
    sep = "m",
    into = c("ano", "mes")
  ) %>% 
  mutate(
    data = make_date(ano, mes, 1) %>% yearmonth()
  ) 

crime_delegacias_ts <- as_tsibble(
  x = crime_delegacia,
  index = data,
  key = c(CISP, AISP,  RISP, munic, Regiao)
) 

crime_delegacias_ts 
## # A tsibble: 26,217 x 62 [1M]
## # Key:       CISP, AISP, RISP, munic, Regiao [188]
##     CISP  AISP  RISP munic Regiao ano   mes   hom_doloso lesao_corp_morte
##    <dbl> <dbl> <dbl> <chr> <chr>  <chr> <chr>      <dbl>            <dbl>
##  1     1     5     1 Rio ~ Capit~ 2003  1              0                0
##  2     1     5     1 Rio ~ Capit~ 2003  2              0                0
##  3     1     5     1 Rio ~ Capit~ 2003  3              0                0
##  4     1     5     1 Rio ~ Capit~ 2003  4              0                0
##  5     1     5     1 Rio ~ Capit~ 2003  5              0                0
##  6     1     5     1 Rio ~ Capit~ 2003  6              0                0
##  7     1     5     1 Rio ~ Capit~ 2003  7              0                0
##  8     1     5     1 Rio ~ Capit~ 2003  8              0                0
##  9     1     5     1 Rio ~ Capit~ 2003  9              1                0
## 10     1     5     1 Rio ~ Capit~ 2003  10             0                0
## # ... with 26,207 more rows, and 53 more variables: latrocinio <dbl>,
## #   hom_por_interv_policial <dbl>, tentat_hom <dbl>, lesao_corp_dolosa <dbl>,
## #   estupro <dbl>, hom_culposo <dbl>, lesao_corp_culposa <dbl>,
## #   roubo_comercio <dbl>, roubo_residencia <dbl>, roubo_veiculo <dbl>,
## #   roubo_carga <dbl>, roubo_transeunte <dbl>, roubo_em_coletivo <dbl>,
## #   roubo_banco <dbl>, roubo_cx_eletronico <dbl>, roubo_celular <dbl>,
## #   roubo_conducao_saque <dbl>, roubo_apos_saque <dbl>, roubo_bicicleta <lgl>,
## #   outros_roubos <dbl>, total_roubos <dbl>, furto_veiculos <dbl>,
## #   furto_transeunte <dbl>, furto_coletivo <dbl>, furto_celular <dbl>,
## #   furto_bicicleta <lgl>, outros_furtos <dbl>, total_furtos <dbl>,
## #   sequestro <dbl>, extorsao <dbl>, sequestro_relampago <dbl>,
## #   estelionato <dbl>, apreensao_drogas <dbl>, posse_drogas <lgl>,
## #   trafico_drogas <lgl>, apreensao_drogas_sem_autor <lgl>,
## #   recuperacao_veiculos <dbl>, apf <lgl>, aaapai <lgl>, cmp <lgl>, cmba <lgl>,
## #   ameaca <dbl>, pessoas_desaparecidas <dbl>, encontro_cadaver <dbl>,
## #   encontro_ossada <dbl>, pol_militares_mortos_serv <dbl>,
## #   pol_civis_mortos_serv <dbl>, indicador_letalidade <dbl>,
## #   indicador_roubo_rua <dbl>, indicador_roubo_veic <dbl>,
## #   registro_ocorrencias <dbl>, fase <dbl>, data <mth>

Observando algumas características da série

feasts é uma biblioteca que ajuda a extrair características de uma série temporal.

Abaixo vemos os três gráficos que ela gera com a função gg_tsdisplay():

conjunto <- gg_tsdisplay(
  crime_estado_ts, 
  y = hom_doloso
) 


embeleza_gg_tsdisplay <- function(obj){
  
  
  
  obj[[1]] <- obj[[1]] + 
    theme_minimal() +
    scale_x_yearmonth(
      date_breaks = "1 year"
    ) +
    theme(
      axis.text.x = element_text(angle = 90)
    )
  
  obj[[2]] <- obj[[2]] + 
    theme_minimal() +
    scale_y_continuous(
      limits = c(-1,1)
    )
  
  
  obj[[3]] <- obj[[3]] + 
    theme_minimal() +
    scale_color_gradient(
      low = "lightblue",
      high = "darkblue",
      labels = function(x){x+2003},
      breaks = seq(0, 20, by = 2)
    ) + 
    theme(
      axis.text.x = element_text(angle = 90)
    )

  obj
    
}


embeleza_gg_tsdisplay(conjunto)

Testando se a série é estacionária

Existem testes que verificam a hipótese de que a série é estacionária.

No teste executado abaixo, Kwiatkowski, a hipótese nula é de que a série é não-estacionária (na verdade não-estacionária com a posibilidade de tendência constante).

Sendo assim p-value obtido mostra que a chance de se obter um resultado tão extremo quanto o observado se a hipótese nula for verdadeira é baixíssima e podemos rejeitar a hipótese nula.

crime_estado_ts %>% 
  features(log(hom_doloso), unitroot_kpss )
## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      2.25        0.01

Primeiras diferenças do log da série

Uma transformação comum nas séries temporais que muitas vezes deriva uma série estacionária a partir de uma série não-estacionária é aplicar a diferença no log da série. Ou seja, a série \(y\) obtida a partir da série \(x\) é \(y_t = log(x_t) - log(x_{t-1})\).

A transformação faz sentido, pois:

Uma característica interessante do log na base natural é que \(log(a) - log(b)\) se aproxima de \(\frac{a}{b}-1\), que é o retorno percentual. Esta aproximação funciona melhor quanto este número é pequeno.

Essa característica facilita a interpretação de alguns resultados.

crime_estado_ts %>% 
  mutate(
    diff_perc = hom_doloso/lag(hom_doloso) - 1
  ) %>% 
  select(
    data,
    hom_doloso,
    hom_doloso_diff_log,
    diff_perc
  )  %>% 
  gt() %>% 
  fmt_percent(
    columns = vars(hom_doloso_diff_log, diff_perc)
    
  )
data hom_doloso hom_doloso_diff_log diff_perc
2003 jan 589 0.00% NA
2003 fev 583 −1.02% −1.02%
2003 mar 613 5.02% 5.15%
2003 abr 585 −4.68% −4.57%
2003 mai 599 2.36% 2.39%
2003 jun 546 −9.26% −8.85%
2003 jul 527 −3.54% −3.48%
2003 ago 500 −5.26% −5.12%
2003 set 512 2.37% 2.40%
2003 out 489 −4.60% −4.49%
2003 nov 528 7.67% 7.98%
2003 dez 553 4.63% 4.73%
2004 jan 578 4.42% 4.52%
2004 fev 540 −6.80% −6.57%
2004 mar 529 −2.06% −2.04%
2004 abr 514 −2.88% −2.84%
2004 mai 605 16.30% 17.70%
2004 jun 502 −18.66% −17.02%
2004 jul 505 0.60% 0.60%
2004 ago 521 3.12% 3.17%
2004 set 507 −2.72% −2.69%
2004 out 522 2.92% 2.96%
2004 nov 570 8.80% 9.20%
2004 dez 545 −4.49% −4.39%
2005 jan 607 10.77% 11.38%
2005 fev 619 1.96% 1.98%
2005 mar 682 9.69% 10.18%
2005 abr 526 −25.97% −22.87%
2005 mai 561 6.44% 6.65%
2005 jun 488 −13.94% −13.01%
2005 jul 580 17.27% 18.85%
2005 ago 562 −3.15% −3.10%
2005 set 533 −5.30% −5.16%
2005 out 503 −5.79% −5.63%
2005 nov 495 −1.60% −1.59%
2005 dez 464 −6.47% −6.26%
2006 jan 480 3.39% 3.45%
2006 fev 521 8.20% 8.54%
2006 mar 607 15.28% 16.51%
2006 abr 579 −4.72% −4.61%
2006 mai 548 −5.50% −5.35%
2006 jun 475 −14.30% −13.32%
2006 jul 478 0.63% 0.63%
2006 ago 471 −1.48% −1.46%
2006 set 521 10.09% 10.62%
2006 out 552 5.78% 5.95%
2006 nov 527 −4.63% −4.53%
2006 dez 564 6.79% 7.02%
2007 jan 526 −6.98% −6.74%
2007 fev 486 −7.91% −7.60%
2007 mar 640 27.53% 31.69%
2007 abr 572 −11.23% −10.62%
2007 mai 466 −20.50% −18.53%
2007 jun 445 −4.61% −4.51%
2007 jul 457 2.66% 2.70%
2007 ago 524 13.68% 14.66%
2007 set 447 −15.89% −14.69%
2007 out 486 8.37% 8.72%
2007 nov 528 8.29% 8.64%
2007 dez 556 5.17% 5.30%
2008 jan 538 −3.29% −3.24%
2008 fev 505 −6.33% −6.13%
2008 mar 527 4.26% 4.36%
2008 abr 475 −10.39% −9.87%
2008 mai 412 −14.23% −13.26%
2008 jun 402 −2.46% −2.43%
2008 jul 413 2.70% 2.74%
2008 ago 430 4.03% 4.12%
2008 set 435 1.16% 1.16%
2008 out 557 24.72% 28.05%
2008 nov 516 −7.65% −7.36%
2008 dez 507 −1.76% −1.74%
2009 jan 551 8.32% 8.68%
2009 fev 556 0.90% 0.91%
2009 mar 588 5.60% 5.76%
2009 abr 542 −8.15% −7.82%
2009 mai 522 −3.76% −3.69%
2009 jun 439 −17.32% −15.90%
2009 jul 397 −10.06% −9.57%
2009 ago 432 8.45% 8.82%
2009 set 433 0.23% 0.23%
2009 out 419 −3.29% −3.23%
2009 nov 438 4.43% 4.53%
2009 dez 476 8.32% 8.68%
2010 jan 447 −6.29% −6.09%
2010 fev 473 5.65% 5.82%
2010 mar 492 3.94% 4.02%
2010 abr 432 −13.01% −12.20%
2010 mai 361 −17.95% −16.44%
2010 jun 347 −3.96% −3.88%
2010 jul 324 −6.86% −6.63%
2010 ago 344 5.99% 6.17%
2010 set 360 4.55% 4.65%
2010 out 406 12.02% 12.78%
2010 nov 364 −10.92% −10.34%
2010 dez 417 13.59% 14.56%
2011 jan 425 1.90% 1.92%
2011 fev 368 −14.40% −13.41%
2011 mar 381 3.47% 3.53%
2011 abr 403 5.61% 5.77%
2011 mai 368 −9.09% −8.68%
2011 jun 307 −18.12% −16.58%
2011 jul 331 7.53% 7.82%
2011 ago 371 11.41% 12.08%
2011 set 323 −13.85% −12.94%
2011 out 318 −1.56% −1.55%
2011 nov 339 6.39% 6.60%
2011 dez 345 1.75% 1.77%
2012 jan 329 −4.75% −4.64%
2012 fev 395 18.28% 20.06%
2012 mar 394 −0.25% −0.25%
2012 abr 342 −14.15% −13.20%
2012 mai 346 1.16% 1.17%
2012 jun 318 −8.44% −8.09%
2012 jul 298 −6.50% −6.29%
2012 ago 294 −1.35% −1.34%
2012 set 331 11.85% 12.59%
2012 out 314 −5.27% −5.14%
2012 nov 325 3.44% 3.50%
2012 dez 395 19.51% 21.54%
2013 jan 397 0.51% 0.51%
2013 fev 389 −2.04% −2.02%
2013 mar 411 5.50% 5.66%
2013 abr 417 1.45% 1.46%
2013 mai 430 3.07% 3.12%
2013 jun 362 −17.21% −15.81%
2013 jul 302 −18.12% −16.57%
2013 ago 407 29.84% 34.77%
2013 set 378 −7.39% −7.13%
2013 out 377 −0.26% −0.26%
2013 nov 414 9.36% 9.81%
2013 dez 461 10.75% 11.35%
2014 jan 464 0.65% 0.65%
2014 fev 482 3.81% 3.88%
2014 mar 510 5.65% 5.81%
2014 abr 449 −12.74% −11.96%
2014 mai 444 −1.12% −1.11%
2014 jun 377 −16.36% −15.09%
2014 jul 370 −1.87% −1.86%
2014 ago 373 0.81% 0.81%
2014 set 345 −7.80% −7.51%
2014 out 375 8.34% 8.70%
2014 nov 345 −8.34% −8.00%
2014 dez 408 16.77% 18.26%
2015 jan 439 7.32% 7.60%
2015 fev 326 −29.76% −25.74%
2015 mar 382 15.85% 17.18%
2015 abr 339 −11.94% −11.26%
2015 mai 347 2.33% 2.36%
2015 jun 272 −24.35% −21.61%
2015 jul 306 11.78% 12.50%
2015 ago 336 9.35% 9.80%
2015 set 351 4.37% 4.46%
2015 out 380 7.94% 8.26%
2015 nov 340 −11.12% −10.53%
2015 dez 382 11.65% 12.35%
2016 jan 406 6.09% 6.28%
2016 fev 404 −0.49% −0.49%
2016 mar 445 9.67% 10.15%
2016 abr 475 6.52% 6.74%
2016 mai 369 −25.25% −22.32%
2016 jun 373 1.08% 1.08%
2016 jul 368 −1.35% −1.34%
2016 ago 387 5.03% 5.16%
2016 set 425 9.37% 9.82%
2016 out 465 8.99% 9.41%
2016 nov 461 −0.86% −0.86%
2016 dez 464 0.65% 0.65%
2017 jan 479 3.18% 3.23%
2017 fev 503 4.89% 5.01%
2017 mar 498 −1.00% −0.99%
2017 abr 436 −13.30% −12.45%
2017 mai 425 −2.56% −2.52%
2017 jun 390 −8.59% −8.24%
2017 jul 374 −4.19% −4.10%
2017 ago 397 5.97% 6.15%
2017 set 458 14.29% 15.37%
2017 out 486 5.93% 6.11%
2017 nov 455 −6.59% −6.38%
2017 dez 445 −2.22% −2.20%
2018 jan 468 5.04% 5.17%
2018 fev 441 −5.94% −5.77%
2018 mar 508 14.14% 15.19%
2018 abr 476 −6.51% −6.30%
2018 mai 423 −11.80% −11.13%
2018 jun 375 −12.04% −11.35%
2018 jul 410 8.92% 9.33%
2018 ago 360 −13.01% −12.20%
2018 set 382 5.93% 6.11%
2018 out 383 0.26% 0.26%
2018 nov 378 −1.31% −1.31%
2018 dez 346 −8.85% −8.47%
2019 jan 386 10.94% 11.56%
2019 fev 319 −19.06% −17.36%
2019 mar 344 7.55% 7.84%
2019 abr 356 3.43% 3.49%

Observando as características de diff(log(serie))

A série de diferenças do log apresenta um média e variância mais constante visualmente.

As autocorrelações ainda estão presentes, mas têm magnitude bem menor. É possível notar uma autocorrelação com o lag 12, devida à sazonalidade.

serie_diff_log <- gg_tsdisplay(crime_estado_ts, y = hom_doloso_diff_log )

embeleza_gg_tsdisplay(serie_diff_log)

Testando a estacionariedade de diff(log(serie))

O teste dessa vez não oferece evidências para descartarmos a hipótese nula de que a aérie é não-estacionária.

crime_estado_ts %>% 
  features(hom_doloso_diff_log, unitroot_kpss )
## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0212         0.1

Uma decomposição da série

A feasts oferece uma forma de decompor a série temporal em tendência, sazonalidade e ruído.

dcmp <- crime_estado_ts %>%
  model(STL(hom_doloso ~ season(window = Inf)))

components(dcmp) %>%  autoplot()

Decomposição de diff(log(serie))

dcmp <- crime_estado_ts %>%
  model(STL(hom_doloso_diff_log ~ season(window = Inf)))

components(dcmp) %>%  autoplot()

Modelo autoregressivo ARIMA

Um modelo muito usado para previsão de séries temporais, que leva em conta e tira proveito da propriedade da autocorrelação, é o ARIMA.

ARIMA é uma junção dos modelos AR (autoregressivo) e MA (“médias móveis”, na verdade soma dos choques prévios). O I (integrated) fica por conta da transformação, que fizemos anteriormente, que pega a diferença entre elementos da série.

No modelo Autoregressivo (AR), a série é representada como um processo dependente dos elementos anteriores da própria série, mais um choque em \(t\). Um AR(p), portanto, é assim:

\[x_t = \phi_1 z_{t-1} + \phi_2 z_{t-2} + ... + \phi_p z_{t-p} + a_t \]

No modelo “Média Móvel” (MA), a série é representada como um processo dependente dos choques anteriores, mais um choque em \(t\). Um MA(q), portanto, é assim:

\[x_t = \theta_1 a_{t-1} + \theta_2 a_{t-2} + ... + \theta_q a_{t-q} + a_t \]

É possível descrever um processo AR com um procersso MA e vice-versa, mas isso exigiria uma quantidade infinita de coeficientes.

Por parcimônia, pois os coeficientes precisam ser estimados e cada estimativa eleva a incerteza, é comum conjugar os dois modelos em um ARMA(p,q):

\[x_t = \phi_1 z_{t-1} + \phi_2 z_{t-2} + ... + \phi_p z_{t-p} + \theta_1 a_{t-1} + \theta_2 a_{t-2} + ... + \theta_q a_{t-q} + a_t \]

Ao usar o ARMA na diferença entre elementos da série, ele vira ARIMA(p,d,q), onde o d é o número de vezes que se diferencia a série.

É possível adicionar P + Q termos autoregressivos e de média móvel referentes a multiplos do lag que representa o período m da série.

No caso de uma série mensal com sazonalidade anual, seriam adicionados termos AR e MA relativos aos lags 12, 24, 36, … e a diferenciação poderia ser feita no lag 12.

Alguns chamam este modelo com sazonalidade de SARIMA(p, d, q)(P, D, Q)[m]

Estimando um ARIMA com fable

A biblioteca fable ajuda a estimar alguns modelos usados para previsão de séries temporais.

Um dos modelos disponíveis é o SARIMA.

A especificação do modelo é escolhida de forma a obter um modelo parcimonioso. A métrica de desempenho usada (AIC) penaliza os modelos com mais coeficientes, portanto maior soma de p, q, P e Q.

O d e D são escolhidos de forma a serem os menores em que a série se apresenta estacionária.

Note que usamos o log, pois acreditamos que a série sofre alterações sempre relativas ao seu patamar anterior e não absolutas.

modelo <- crime_estado_ts %>%
  model(arima = ARIMA(log(hom_doloso) ~ 0 )) 

modelo %>%
  report()
## Series: hom_doloso 
## Model: ARIMA(1,1,2)(0,0,2)[12] 
## Transformation: log(hom_doloso) 
## 
## Coefficients:
##          ar1     ma1     ma2    sma1    sma2
##       0.8257  -1.171  0.2042  0.2223  0.2101
## s.e.  0.0888   0.126  0.1070  0.0774  0.0609
## 
## sigma^2 estimated as 0.008022:  log likelihood=195.44
## AIC=-378.89   AICc=-378.44   BIC=-359.25

A função accuracy mostra algumas métricas de perfomance NO CONJUNTO DE TREINAMENTO

modelo %>%
  accuracy()
## # A tibble: 1 x 10
##   .model .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 arima  Training -2.55  38.9  31.6 -1.22  7.27 0.587 0.587 0.00749

Projetando valores usando o modelo estimado

A função forecast da fabletools nos permite extrair as previsões do modelo para um horizonte.

Veja que os números já são transformados de volta para a unidade original da série

previsao <- modelo %>% 
  forecast(
    h = "1 year"
  ) 


previsao
## # A fable: 12 x 4 [1M]
## # Key:     .model [1]
##    .model     data       hom_doloso .mean
##    <chr>     <mth>           <dist> <dbl>
##  1 arima  2019 mai t(N(5.9, 0.008))  358.
##  2 arima  2019 jun t(N(5.9, 0.011))  350.
##  3 arima  2019 jul t(N(5.9, 0.014))  360.
##  4 arima  2019 ago t(N(5.9, 0.016))  355.
##  5 arima  2019 set t(N(5.9, 0.018))  368.
##  6 arima  2019 out t(N(5.9, 0.019))  373.
##  7 arima  2019 nov  t(N(5.9, 0.02))  370.
##  8 arima  2019 dez t(N(5.9, 0.021))  364.
##  9 arima  2020 jan t(N(5.9, 0.022))  377.
## 10 arima  2020 fev t(N(5.9, 0.023))  356.
## 11 arima  2020 mar t(N(5.9, 0.023))  372.
## 12 arima  2020 abr t(N(5.9, 0.024))  375.

Existe uma função de autoplot(), da biblioteca feasts que mostra um gráfico incluindo o intervalo de confiança.

previsao %>% 
  autoplot(
    crime_estado_ts
  ) +
  theme_minimal()

Adicionando regressores adicionais ao ARIMA

É possível adicionar outros regressores além dos termos relacionados ao ARIMA.

Imagina que acreditemos que o calor faz as pessoas ficarem mais nervosas a ponto de aumentar o número de homicídios.

Vamos pegar os dados de temperatura de uma estação na cidade do Rio de janeiro como proxy da temperatura do estado.

estacao_rio <- worldmet::getMeta(country = "BR")



dados_temp <- worldmet::importNOAA(code = "837460-99999", year = 2003:2020, hourly = TRUE)

write_rds(dados_temp, "dados/crime_rio/temperatura.rds")

Vamos usar o percentil 67 da temperatura horária a cada mês

dados_temp <- read_rds("dados/crime_rio/temperatura.rds")

dados_temperatura_mensais_normais <- dados_temp %>% 
  mutate(
    mes = month(date)
  ) %>% 
  group_by(
    mes
  ) %>% 
  summarise(
    p67_temp_normal = quantile(air_temp, probs = .5, na.rm = TRUE)
  )



dados_temperatura_mensais <- dados_temp %>% 
  mutate(
    ano_mes = yearmonth(date),
    mes = month(date)
  ) %>% 
  group_by(
    ano_mes,
    mes
  ) %>% 
  summarise(
    p67_temp = quantile(air_temp, probs = .67, na.rm = TRUE)
  ) %>%  
  left_join(
    dados_temperatura_mensais_normais, by = c("mes")
  ) %>% 
  mutate(
    desvio_temperatura = p67_temp - p67_temp_normal
  ) 

E juntar aos dados que já tínhamos

crime_estado_ts <- crime_estado_ts %>% 
  left_join(
    dados_temperatura_mensais,
    by = c("data" = "ano_mes")
  )

É possível rodar o modelo com essa variável.

O resultado estima que a cada grau de temperatura adicional, observamos 0,02 a mais no log da diferença da série (veja que d=1). Lembrando da equivalência entre log e delta percentual, podemos dizer que observamos aproximadamente 2% de aumento nos homicídios a cada grau de temperatura adicional.

modelo_temp <- crime_estado_ts %>% 
  model(arima = ARIMA(log(hom_doloso) ~  p67_temp )) 

modelo_temp %>%
  report()
## Series: hom_doloso 
## Model: LM w/ ARIMA(0,1,1)(0,0,2)[12] errors 
## Transformation: log(hom_doloso) 
## 
## Coefficients:
##           ma1    sma1    sma2  p67_temp
##       -0.4882  0.1066  0.1446    0.0218
## s.e.   0.0682  0.0759  0.0622    0.0034
## 
## sigma^2 estimated as 0.007119:  log likelihood=207.02
## AIC=-404.04   AICc=-403.73   BIC=-387.68

O modelo oferece uma performance um pouco melhor, também, DENTRO DA AMOSTRA

modelo_temp %>%  accuracy()
## # A tibble: 1 x 10
##   .model .type       ME  RMSE   MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>  <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 arima  Training -1.04  37.5  30.1 -0.816  6.85 0.559 0.566 0.0412

Mas não sabemos o valor da variável exógena!

É claro que isso é um modelo inexequível, pois não sabemos qual será a temperatura nos próximos meses, mas poderíamos traçar cenários ou usar um outro modelo para prever essa variável.

Aqui mostramos dois cenários, com o percentil 67 da temperatura em 26 e em 29

previsao_29 <- forecast(
  modelo_temp, 
  h = 10, 
  new_data = 
    tsibble(
      data = 
        seq.Date(
          from = max(crime_estado_ts$data %>% as_date() %m+% months(1)), 
          length.out = 12, 
          by = "1 month") %>%  yearmonth(),
      p67_temp = replicate(expr = 29, n = 12),
      index = data
    )
)

previsao_26 <- forecast(
  modelo_temp, 
  h = 10, 
  new_data = 
    tsibble(
      data = 
        seq.Date(
          from = max(crime_estado_ts$data %>% as_date() %m+% months(1)), 
          length.out = 12, 
          by = "1 month") %>%  yearmonth(),
      p67_temp = replicate(expr = 26, n = 12),
      index = data
    )
)
ggplot() +
  geom_line(
    data = crime_estado_ts,
    aes(
      x = data,
      y = hom_doloso
    )
  ) +
  geom_point(
    data = crime_estado_ts,
    aes(
      x = data,
      y = hom_doloso
    )
  ) +
  geom_line(
    data = previsao_29,
    aes(
      x = data,
      y = .mean
    ),
    color = "red"
  ) +
  geom_point(
    data = previsao_29,
    aes(
      x = data,
      y = .mean
    ),
    color = "red"
  ) +
    geom_line(
    data = previsao_26,
    aes(
      x = data,
      y = .mean
    ),
    color = "blue"
  ) +
  geom_point(
    data = previsao_26,
    aes(
      x = data,
      y = .mean
    ),
    color = "blue"
  ) 

Dividindo os dados em treinamento e validação

Assim como fizemos nos modelos que não eram de séries temporais, podemos dividir os dados em treinamento em teste.

Mas no caso das sérier, precisamos manter a ordem cronológica de cada um deles.

Podemos escolher uma data específica e separar os dados anteriores para treinamento e os posteriores para teste.

Primeiro o modelo sem temperatura

modelo_puro <- function(dado){
  dado %>% 
  model(arima = ARIMA(log(hom_doloso) ))
}
  

crime_estado_ts_treino <-  crime_estado_ts %>% 
  filter(
    data < yearmonth("20190101")
  )

fit_treino <- crime_estado_ts_treino %>% 
  modelo_puro

previsao <- fit_treino %>% 
  forecast(
    crime_estado_ts %>% filter(data >= yearmonth("20190101"))
  )

accuracy(previsao, crime_estado_ts)
## # A tibble: 1 x 10
##   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 arima  Test  -31.3  43.3  39.5 -9.50  11.6 0.754 0.673 -0.251

Depois o modelo com temperatura, que se sai melhor

modelo_com_temp <- function(dado){
  dado %>% 
  model(arima = ARIMA(log(hom_doloso) ~ p67_temp ))
}
  

crime_estado_ts_treino <-  crime_estado_ts %>% 
  filter(
    data < yearmonth("20190101")
  )

fit_treino <- crime_estado_ts_treino %>% 
  modelo_com_temp

previsao <- fit_treino %>% 
  forecast(
    crime_estado_ts %>% filter(data >= yearmonth("20190101"))
  )

accuracy(previsao, crime_estado_ts)
## # A tibble: 1 x 10
##   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 arima  Test  -30.7  36.4  30.7 -9.13  9.13 0.586 0.566 -0.292

Validação rolling

Podemos rolar essa data de corte entre dados de treinamento e dados de validação, criando uma série com dados projetados sempre ao mesmo horizonte

Na imagem abaixo, o esquema é mostrado: a cada rodada escolhemos uma data de corte, que vai rolando, deixamos os dados anteriores como conjunto de treinamento e o dado imediatamente posterior como dado de validação (poderíamos escolher o dado sempre 6 meses posterior à data de corte, por exemplo se quiséssemos avaliar a performance neste horizonte).

Abaixo implementamos uma função que deve ser chamada a cada data de corte escolhida

previsao_fora_amostra <- function(modelo, data_corte, dado){
  
  dado_treino <-  dado %>% 
  filter(
    data < yearmonth(data_corte)
  )

  fit_treino <- dado_treino %>% 
    modelo
  
  previsao <- fit_treino %>% 
    forecast(
      dado %>% filter(data >= yearmonth(data_corte)) %>% slice_min(1)
    )
  

  previsao %>% 
    select(
      data, .mean
    ) %>% 
    as_tibble() %>% 
    select(
        data, .mean
    )    

}

Realizamos o procedimento para o modelo simples, sem temperatura

cortes <- seq.Date(
  from = make_date(2013,1,1),
  to = max(crime_estado_ts$data) %>% as.Date(),
  by = "1 month"
)

plan(multiprocess)

previsoes_fora_1_puro <- cortes %>% 
  future_map_dfr(
    .f = ~previsao_fora_amostra(modelo = modelo_puro, data_corte = .x, dado = crime_estado_ts),
    .progress = TRUE
  )

previsoes_fora_1_puro_ts <- previsoes_fora_1_puro %>% 
  mutate(
    hom_doloso = .mean
  ) %>% 
  as_tsibble(
    index = data
  )

dados_reais <- crime_estado_ts %>% 
  filter(
    data %in% (cortes %>% yearmonth())
  )


performance_puro <- accuracy(previsoes_fora_1_puro_ts$hom_doloso, dados_reais$hom_doloso)

write_rds(performance_puro, "dados/crime_rio/performance_puro.rds")

E para o modelo com temperatura

plan(multiprocess)

previsoes_fora_1_com_temp <- cortes %>% 
  future_map_dfr(
    .f = ~previsao_fora_amostra(modelo = modelo_com_temp, data_corte = .x, dado = crime_estado_ts),
    .progress = TRUE
  )

previsoes_fora_1_com_temp_ts <- previsoes_fora_1_com_temp %>% 
  mutate(
    hom_doloso = .mean
  ) %>% 
  as_tsibble(
    index = data
  )

dados_reais_com_temp <- crime_estado_ts %>% 
  filter(
    data %in% (cortes %>% yearmonth())
  )

performance_com_temp <- accuracy(previsoes_fora_1_com_temp_ts$hom_doloso, dados_reais_com_temp$hom_doloso)

write_rds(performance_com_temp, "dados/crime_rio/performance_com_temp.rds")

O modelo com temperatura se mostrou melhor.

read_rds("dados/crime_rio/performance_puro.rds")
##                 ME     RMSE      MAE        MPE     MAPE
## Test set 0.8225141 37.14884 31.30003 -0.5059692 8.020367
read_rds("dados/crime_rio/performance_com_temp.rds")
##                 ME     RMSE      MAE        MPE     MAPE
## Test set 0.1528492 33.96707 27.68021 -0.5687739 7.140369

Outros modelos para comparação: NAIVE

A biblioteca fable disponibiliza outros modelos, alguns deles bem simples.

É uma boa prática realizar o mesmo teste com estes modelos simples, de forma a identificar se realmente o modelo mais complexo está adicionando valor.

O modelo Naive com sazonalidade repete o valor do período anterior (considerando a sazonalidade) e adiciona um drift constante estimado para a série.

modelo_bobo <- function(dado){
  dado %>% 
    model(snaive = SNAIVE(log(hom_doloso) ~ drift() + lag("year")))
}



plan(multiprocess)

previsoes_fora_1_bobo <- cortes %>% 
  future_map_dfr(
    .f = ~previsao_fora_amostra(modelo = modelo_bobo, data_corte = .x, dado = crime_estado_ts),
    .progress = TRUE
  )

previsoes_fora_1_bobo_ts <- previsoes_fora_1_bobo %>% 
  mutate(
    hom_doloso = .mean
  ) %>% 
  as_tsibble(
    index = data
  )


performance_bobo <- accuracy(previsoes_fora_1_bobo_ts$hom_doloso, dados_reais_com_temp$hom_doloso)



write_rds(performance_bobo, "dados/crime_rio/performance_bobo")

O resultado é pior

read_rds("dados/crime_rio/performance_bobo")
##                ME     RMSE      MAE      MPE   MAPE
## Test set 11.79164 71.99037 59.94568 1.559055 15.201

Outros modelos para comparação: TSLM

Este modelo é estimado com uma tendência e a sazonalidade

modelo_decomp <- function(dado){
  dado %>% 
    model(lm = TSLM(log(hom_doloso) ~ trend() + season()))
}



plan(multiprocess)

previsoes_fora_1_decomp <- cortes %>% 
  future_map_dfr(
    .f = ~previsao_fora_amostra(modelo = modelo_decomp, data_corte = .x, dado = crime_estado_ts),
    .progress = TRUE
  )

previsoes_fora_1_decomp_ts <- previsoes_fora_1_decomp %>% 
  mutate(
    hom_doloso = .mean
  ) %>% 
  as_tsibble(
    index = data
  )


performance_decomp <- accuracy(previsoes_fora_1_decomp_ts$hom_doloso, dados_reais_com_temp$hom_doloso)



write_rds(performance_decomp, "dados/crime_rio/performance_tslm")

O erro foi similar ao do modelo NAIVE

read_rds("dados/crime_rio/performance_tslm")
##                ME     RMSE      MAE      MPE    MAPE
## Test set 50.96311 70.38793 60.22174 11.43432 14.2646